Skip to Content
Streaming

Streaming

Set "stream": true on any gateway request. The response is a server-sent events stream in the ingress format’s native protocol — the same events the format’s official SDKs already parse.

Per-format protocols

chat.completion.chunk objects, terminated by data: [DONE]:

data: {"id":"gen-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":""}}]} data: {"id":"gen-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"}}]} data: {"id":"gen-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]} data: {"id":"gen-…","object":"chat.completion.chunk","choices":[],"usage":{"prompt_tokens":12,"completion_tokens":7,"total_tokens":19}} data: [DONE]

Usage in streams — always

Token usage is reported on every request, streamed or not. For streaming, the final usage chunk is always emitted — you do not need to send stream_options: {"include_usage": true} (Chat Completions ingress accepts it for compatibility; the behavior is always on). Usage includes prompt, completion, and reasoning tokens, with cached vs. uncached prompt tokens broken out — see Usage & Credits.

Keep-alives, timeouts, aborts

  • SSE keep-alive comments every 15 seconds, so load balancers and proxies never idle-close a healthy stream.
  • Time to first token up to 5 minutes; 120 seconds inter-chunk idle; 60 minutes absolute stream cap. Details in Limits & Timeouts.
  • If your client aborts mid-stream, the tokens generated up to that point are still metered, billed, and recorded — the partial usage settles when the stream closes.

Errors that occur before any byte has streamed return a normal HTTP error response in your ingress format’s error shape. We never retry mid-stream.