Rate Limits
Rate limits are applied per client (by IP and, where relevant, by path) to protect the API and ensure fair usage. When a limit is exceeded, the API returns429 Too Many Requests.
Limits by Endpoint
| Endpoint / area | Window | Max requests | Notes |
|---|---|---|---|
| Inference | 1 minute | 100 | /models/{id}/inference, etc. |
| Chat completions | 5 minutes | 20 | OpenAI-compatible /chat/completions |
| Batch inference | 1 minute | 20 | Each batch can contain up to 100 items |
| Training start | 1 minute | 10 | Starting new fine-tune jobs |
| Auth (login, signup) | 15 minutes | 10 | Authentication attempts |
| General API | 1 minute | 60 | Other API routes |
| Upload | 1 minute | 10 | File uploads |
| Explorer | 1 minute | 30 | Explorer endpoints |
| Contact form | 1 hour | 5 | Contact / support |
Response When Rate Limited
When you exceed a limit, the API responds with:- Status:
429 Too Many Requests - Body: JSON with
error, optionalmessage, andretryAfter(seconds) - Headers:
Retry-After– Seconds to wait before retryingX-RateLimit-Limit– Max requests in the windowX-RateLimit-Remaining–0when limitedX-RateLimit-Reset– Unix timestamp when the window resets
Best Practices
- Honor
Retry-After– Wait at least that many seconds before retrying. - Use exponential backoff – After repeated 429s, increase delay between retries.
- Batch when possible – Use the batch inference endpoint instead of many single requests.
- Cache – Cache responses where appropriate to reduce request volume.
- Monitor – Track 429 responses in your application to tune concurrency and batching.
Next Steps
- Error codes for other API error responses
- API Overview for authentication and base URLs