Skip to main content

Rate Limits

Rate limits are applied per client (by IP and, where relevant, by path) to protect the API and ensure fair usage. When a limit is exceeded, the API returns 429 Too Many Requests.

Limits by Endpoint

Endpoint / areaWindowMax requestsNotes
Inference1 minute100/models/{id}/inference, etc.
Chat completions5 minutes20OpenAI-compatible /chat/completions
Batch inference1 minute20Each batch can contain up to 100 items
Training start1 minute10Starting new fine-tune jobs
Auth (login, signup)15 minutes10Authentication attempts
General API1 minute60Other API routes
Upload1 minute10File uploads
Explorer1 minute30Explorer endpoints
Contact form1 hour5Contact / support

Response When Rate Limited

When you exceed a limit, the API responds with:
  • Status: 429 Too Many Requests
  • Body: JSON with error, optional message, and retryAfter (seconds)
  • Headers:
    • Retry-After – Seconds to wait before retrying
    • X-RateLimit-Limit – Max requests in the window
    • X-RateLimit-Remaining0 when limited
    • X-RateLimit-Reset – Unix timestamp when the window resets
Example:
{
  "error": "Too many requests. Please slow down.",
  "message": "Too many requests. Currently, there are many requests being processed. Please try again later.",
  "retryAfter": 60
}

Best Practices

  1. Honor Retry-After – Wait at least that many seconds before retrying.
  2. Use exponential backoff – After repeated 429s, increase delay between retries.
  3. Batch when possible – Use the batch inference endpoint instead of many single requests.
  4. Cache – Cache responses where appropriate to reduce request volume.
  5. Monitor – Track 429 responses in your application to tune concurrency and batching.

Next Steps