Inference overview
The platform exposes OpenAI-compatible endpoints for text generation and chat. You send a prompt or messages, choose a model, and receive generated text plus token usage.Endpoints
| Purpose | Method | Path |
|---|---|---|
| Standard completion | POST | /v1/models/{modelId}/inference |
| Chat completions | POST | /v1/chat/completions |
| Batch inference | POST | /v1/batch/inference |
https://api.llmtune.io
Authentication
Every request must include:401 Unauthorized.
Common parameters
| Parameter | Type | Description |
|---|---|---|
prompt | string | Input text (completion endpoint). |
messages | array | Chat messages (chat completions). |
temperature | number | Sampling randomness (e.g. 0.7). |
max_tokens / maxTokens | number | Maximum output tokens. |
top_p / topP | number | Nucleus sampling (e.g. 1.0). |
stream | boolean | Enable streaming (when supported). |
Response shape
Typical success response includes:- text or choices — Generated content.
- tokens or usage — Token counts (input/output).
- model — Model ID used.
- latency — Response time in milliseconds (when provided).
Usage and billing
Each request consumes tokens from your account balance. Input and output tokens are metered and charged; insufficient balance returns402 Payment Required. See Billing & usage.
Next
- Chat completions — Request/response format and examples.
- Streaming — Streaming responses.
- Errors and limits — Rate limits and error handling.