Best practices
Performance
Token optimization
Timeouts
Structuring prompts
Production recommendations
Security

Best practices

Performance

Reuse connections — Use HTTP keep-alive and connection pooling where possible to reduce latency.
Batch when possible — Use the batch inference endpoint for multiple prompts instead of many single requests.
Cache — Cache responses for identical or repeated prompts when correctness allows it to reduce calls and cost.

Token optimization

Shorter prompts — Trim system messages and long context when the task does not need them; you pay for input tokens.
Limit output — Set max_tokens to what you need to avoid unnecessary long completions.
Reuse context — In chat, send only the minimal message history required for the model to respond well.

Timeouts

Set client timeouts (e.g. 30–60 seconds for non-streaming) so slow responses fail fast and you can retry or surface an error.
For streaming, use longer timeouts or no timeout on the read side so the stream is not cut off mid-generation.
If the API returns a timeout (e.g. 504), retry with backoff; avoid tight retry loops.

Structuring prompts

System message — Use a clear, concise system message to set behavior; avoid huge blocks of text unless needed.
Examples — Few-shot examples in the prompt can improve quality but increase tokens; balance quality vs cost.
Format — Ask for structured output (e.g. JSON) when you need to parse the response; it can reduce retries and parsing errors.

Production recommendations

Area	Recommendation
Secrets	Store API keys in a secret manager; never in source code or client bundles.
Retries	Use exponential backoff for 429, 500, 502, 503; respect `retryAfter` when present.
Monitoring	Log request IDs, status codes, and token usage for debugging and cost tracking.
402 handling	Detect 402 and show a clear “Add funds” flow; do not retry without a balance update.
Rate limits	Design for rate limits (e.g. queue requests or throttle) so users see fewer 429s.

Security

Call the API from server-side or trusted backends only; do not expose API keys in browsers or mobile apps.
Rotate keys periodically and when team members or integrations change.
Use separate keys per environment (e.g. dev vs prod) to limit blast radius if a key is leaked.

Errors & status codes

⌘I