Skip to main content

Inference API Guide

LLMTune provides an OpenAI-compatible chat completion API for deployed models. This section shows how to call it with cURL, JavaScript, and Python.

Endpoint

POST https://llmtune.io/api/models/{deployment_id}/inference
Replace {deployment_id} with your deployed model ID.

Authentication

Include a Bearer token created under API Keys.
Authorization: Bearer YOUR_API_KEY

Request Body

{
  "messages": [
    { "role": "system", "content": "You are a polite assistant." },
    { "role": "user", "content": "Summarize this support ticket." }
  ],
  "temperature": 0.7,
  "max_tokens": 400,
  "stream": false
}

Available Parameters

FieldRequiredDescription
messagesYesArray of chat messages
temperatureNoRandomness control (default 0.7)
max_tokensNoMaximum output tokens (default 512)
streamNoEnable SSE streaming
metadataNoArbitrary JSON metadata for observability

Responses

{
  "id": "chatcmpl-123",
  "model": "workspace/model-v1",
  "usage": { "prompt_tokens": 200, "completion_tokens": 150, "total_tokens": 350 },
  "choices": [
    {
      "message": { "role": "assistant", "content": "Here is the summary..." },
      "finish_reason": "stop",
      "index": 0
    }
  ]
}
Errors follow standard HTTP status codes with error and message fields.

cURL Example

curl https://llmtune.io/api/models/{deployment_id}/inference \
  -H "Authorization: Bearer $LLMTUNE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Draft a product update email." }
    ],
    "temperature": 0.5,
    "max_tokens": 300
  }'

JavaScript (Fetch)

const response = await fetch(`https://llmtune.io/api/models/${deploymentId}/inference`, {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.LLMTUNE_API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    messages: [
      { role: 'system', content: 'You are a helpful assistant.' },
      { role: 'user', content: 'Draft a product update email.' }
    ],
    temperature: 0.5,
    max_tokens: 300
  })
});

const data = await response.json();
console.log(data.choices[0].message.content);

Python (requests)

import os
import requests

deployment_id = "workspace-model-v1"
api_key = os.environ["LLMTUNE_API_KEY"]

payload = {
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Draft a product update email."}
    ],
    "temperature": 0.5,
    "max_tokens": 300
}

response = requests.post(
    f"https://llmtune.io/api/models/{deployment_id}/inference",
    headers={
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    },
    json=payload,
    timeout=30
)

response.raise_for_status()
print(response.json()["choices"][0]["message"]["content"])

Streaming

To enable streaming responses, set "stream": true in the request and consume the SSE stream on the client side. Each event will carry a partial token output until data: [DONE] is emitted.