Skip to main content

Chat completions

The chat completions endpoint accepts a list of messages and returns the model’s reply. It is compatible with the OpenAI chat API shape.

Endpoint

POST https://api.llmtune.io/v1/chat/completions

Request body

FieldRequiredTypeDescription
modelYesstringModel ID (e.g. from the models list).
messagesYesarrayArray of { role, content } objects.
temperatureNonumberDefault 0.7.
max_tokensNonumberMax output tokens.
streamNobooleanSet true for streaming.

Message roles

  • system — System instruction (optional).
  • user — User message.
  • assistant — Assistant message (for multi-turn context).

Example request

{
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Summarize this in one sentence." }
  ],
  "temperature": 0.7,
  "max_tokens": 400
}

Example: cURL

curl https://api.llmtune.io/v1/chat/completions \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "messages": [
      { "role": "user", "content": "What is the capital of France?" }
    ],
    "temperature": 0.5,
    "max_tokens": 200
  }'

Example response (non-streaming)

Responses follow an OpenAI-style structure, for example:
{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 8,
    "total_tokens": 20
  }
}
Use choices[0].message.content for the assistant text. Token usage is in usage; this is what is metered and deducted from your balance.

Standard completion (single prompt)

For a single prompt without chat history, use the model inference endpoint:
POST https://api.llmtune.io/v1/models/{modelId}/inference
Body:
{
  "prompt": "Your prompt here.",
  "temperature": 0.6,
  "maxTokens": 256
}
Response typically includes text, tokens, model, and optionally latency.