Skip to main content

Inference API Guide

LLMTune provides an OpenAI-compatible chat completion API for deployed models. This section shows how to call it with cURL, JavaScript, and Python.

Endpoint

POST https://api.llmtune.io/v1/models/{modelId}/inference
For chat, use:
POST https://api.llmtune.io/v1/chat/completions
Replace {modelId} with a supported model ID from the catalog.

Authentication

Include a Bearer token created under API Keys.
Authorization: Bearer YOUR_API_KEY

Request Body

{
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "messages": [
    { "role": "system", "content": "You are a polite assistant." },
    { "role": "user", "content": "Summarize this support ticket." }
  ],
  "temperature": 0.7,
  "max_tokens": 400,
  "stream": false
}

Available Parameters

FieldRequiredDescription
modelYesModel ID from the catalog
messagesYesArray of chat messages
temperatureNoRandomness control (default 0.7)
max_tokensNoMaximum output tokens (default 512)
streamNoEnable SSE streaming
metadataNoArbitrary JSON metadata for observability

Responses

{
  "id": "chatcmpl-123",
  "model": "workspace/model-v1",
  "usage": { "prompt_tokens": 200, "completion_tokens": 150, "total_tokens": 350 },
  "choices": [
    {
      "message": { "role": "assistant", "content": "Here is the summary..." },
      "finish_reason": "stop",
      "index": 0
    }
  ]
}
Errors follow standard HTTP status codes with error and message fields.

cURL Example

curl https://api.llmtune.io/v1/chat/completions \
  -H "Authorization: Bearer $LLMTUNE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Draft a product update email." }
    ],
    "temperature": 0.5,
    "max_tokens": 300
  }'

JavaScript (Fetch)

const response = await fetch('https://api.llmtune.io/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.LLMTUNE_API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'meta-llama/Llama-3.3-70B-Instruct',
    messages: [
      { role: 'system', content: 'You are a helpful assistant.' },
      { role: 'user', content: 'Draft a product update email.' }
    ],
    temperature: 0.5,
    max_tokens: 300
  })
});

const data = await response.json();
console.log(data.choices[0].message.content);

Python (requests)

import os
import requests

model_id = "meta-llama/Llama-3.3-70B-Instruct"
api_key = os.environ["LLMTUNE_API_KEY"]

payload = {
    "model": model_id,
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Draft a product update email."}
    ],
    "temperature": 0.5,
    "max_tokens": 300
}

response = requests.post(
    "https://api.llmtune.io/v1/chat/completions",
    headers={
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    },
    json=payload,
    timeout=30
)

response.raise_for_status()
print(response.json()["choices"][0]["message"]["content"])

Streaming

To enable streaming responses, set "stream": true in the request and consume the SSE stream on the client side. Each event will carry a partial token output until data: [DONE] is emitted.

Next steps