Streaming

You can request streaming responses so tokens are sent as they are generated instead of waiting for the full reply.

Enabling streaming

Set stream: true in the request body for chat completions:

{
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "messages": [{ "role": "user", "content": "Explain recursion." }],
  "stream": true,
  "max_tokens": 500
}

Response format

The response is a stream of server-sent events (SSE). Each event is a line starting with data: followed by JSON.

Content events — Include partial content in the delta (e.g. choices[0].delta.content).
End event — Sent as data: [DONE] when the stream is finished.

Example (conceptual):

data: {"choices":[{"delta":{"content":"Re"}}]}
data: {"choices":[{"delta":{"content":"cursion"}}]}
...
data: [DONE]

Consuming the stream

Use an HTTP client that supports streaming (e.g. fetch with response.body, or a library that handles SSE).
Parse each line: strip the data: prefix and parse the JSON (or handle [DONE]).
Append delta.content to build the full reply.
When you see [DONE], close the stream.

Usage and billing

Token usage is still metered for streaming. The same input and output token counts apply; they may be reported in the final event or in usage fields. Balance is deducted as for non-streaming requests.

Timeouts

Streaming connections can run longer than a single request. Configure your client and server timeouts appropriately so long generations are not cut off unnecessarily.

Overview

Inference

Agent

Fine-tuning

Billing & errors

Streaming

Streaming

Enabling streaming

Response format

Consuming the stream

Usage and billing

Timeouts

Overview

Inference

Agent

Fine-tuning

Billing & errors

​Streaming

​Enabling streaming

​Response format

​Consuming the stream

​Usage and billing

​Timeouts

Streaming

Enabling streaming

Response format

Consuming the stream

Usage and billing

Timeouts