Skip to main content

Best practices

Performance

  • Reuse connections — Use HTTP keep-alive and connection pooling where possible to reduce latency.
  • Batch when possible — Use the batch inference endpoint for multiple prompts instead of many single requests.
  • Cache — Cache responses for identical or repeated prompts when correctness allows it to reduce calls and cost.

Token optimization

  • Shorter prompts — Trim system messages and long context when the task does not need them; you pay for input tokens.
  • Limit output — Set max_tokens to what you need to avoid unnecessary long completions.
  • Reuse context — In chat, send only the minimal message history required for the model to respond well.

Timeouts

  • Set client timeouts (e.g. 30–60 seconds for non-streaming) so slow responses fail fast and you can retry or surface an error.
  • For streaming, use longer timeouts or no timeout on the read side so the stream is not cut off mid-generation.
  • If the API returns a timeout (e.g. 504), retry with backoff; avoid tight retry loops.

Structuring prompts

  • System message — Use a clear, concise system message to set behavior; avoid huge blocks of text unless needed.
  • Examples — Few-shot examples in the prompt can improve quality but increase tokens; balance quality vs cost.
  • Format — Ask for structured output (e.g. JSON) when you need to parse the response; it can reduce retries and parsing errors.

Production recommendations

AreaRecommendation
SecretsStore API keys in a secret manager; never in source code or client bundles.
RetriesUse exponential backoff for 429, 500, 502, 503; respect retryAfter when present.
MonitoringLog request IDs, status codes, and token usage for debugging and cost tracking.
402 handlingDetect 402 and show a clear “Add funds” flow; do not retry without a balance update.
Rate limitsDesign for rate limits (e.g. queue requests or throttle) so users see fewer 429s.

Security

  • Call the API from server-side or trusted backends only; do not expose API keys in browsers or mobile apps.
  • Rotate keys periodically and when team members or integrations change.
  • Use separate keys per environment (e.g. dev vs prod) to limit blast radius if a key is leaked.