Skip to main content

Streaming

You can request streaming responses so tokens are sent as they are generated instead of waiting for the full reply.

Enabling streaming

Set stream: true in the request body for chat completions:
{
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "messages": [{ "role": "user", "content": "Explain recursion." }],
  "stream": true,
  "max_tokens": 500
}

Response format

The response is a stream of server-sent events (SSE). Each event is a line starting with data: followed by JSON.
  • Content events — Include partial content in the delta (e.g. choices[0].delta.content).
  • End event — Sent as data: [DONE] when the stream is finished.
Example (conceptual):
data: {"choices":[{"delta":{"content":"Re"}}]}
data: {"choices":[{"delta":{"content":"cursion"}}]}
...
data: [DONE]

Consuming the stream

  1. Use an HTTP client that supports streaming (e.g. fetch with response.body, or a library that handles SSE).
  2. Parse each line: strip the data: prefix and parse the JSON (or handle [DONE]).
  3. Append delta.content to build the full reply.
  4. When you see [DONE], close the stream.

Usage and billing

Token usage is still metered for streaming. The same input and output token counts apply; they may be reported in the final event or in usage fields. Balance is deducted as for non-streaming requests.

Timeouts

Streaming connections can run longer than a single request. Configure your client and server timeouts appropriately so long generations are not cut off unnecessarily.