Best practices
Performance
- Reuse connections — Use HTTP keep-alive and connection pooling where possible to reduce latency.
- Batch when possible — Use the batch inference endpoint for multiple prompts instead of many single requests.
- Cache — Cache responses for identical or repeated prompts when correctness allows it to reduce calls and cost.
Token optimization
- Shorter prompts — Trim system messages and long context when the task does not need them; you pay for input tokens.
- Limit output — Set
max_tokensto what you need to avoid unnecessary long completions. - Reuse context — In chat, send only the minimal message history required for the model to respond well.
Timeouts
- Set client timeouts (e.g. 30–60 seconds for non-streaming) so slow responses fail fast and you can retry or surface an error.
- For streaming, use longer timeouts or no timeout on the read side so the stream is not cut off mid-generation.
- If the API returns a timeout (e.g. 504), retry with backoff; avoid tight retry loops.
Structuring prompts
- System message — Use a clear, concise system message to set behavior; avoid huge blocks of text unless needed.
- Examples — Few-shot examples in the prompt can improve quality but increase tokens; balance quality vs cost.
- Format — Ask for structured output (e.g. JSON) when you need to parse the response; it can reduce retries and parsing errors.
Production recommendations
| Area | Recommendation |
|---|---|
| Secrets | Store API keys in a secret manager; never in source code or client bundles. |
| Retries | Use exponential backoff for 429, 500, 502, 503; respect retryAfter when present. |
| Monitoring | Log request IDs, status codes, and token usage for debugging and cost tracking. |
| 402 handling | Detect 402 and show a clear “Add funds” flow; do not retry without a balance update. |
| Rate limits | Design for rate limits (e.g. queue requests or throttle) so users see fewer 429s. |
Security
- Call the API from server-side or trusted backends only; do not expose API keys in browsers or mobile apps.
- Rotate keys periodically and when team members or integrations change.
- Use separate keys per environment (e.g. dev vs prod) to limit blast radius if a key is leaked.