Documentation Index
Fetch the complete documentation index at: https://docs.llmtune.io/llms.txt
Use this file to discover all available pages before exploring further.
Inference API Guide
LLMTune provides an OpenAI-compatible chat completion API for deployed models. This section shows how to call it with cURL, JavaScript, and Python.
Endpoint
POST https://api.llmtune.io/v1/models/{modelId}/inference
For chat, use:
POST https://api.llmtune.io/v1/chat/completions
Replace {modelId} with a supported model ID from the catalog.
Authentication
Include a Bearer token created under API Keys.
Authorization: Bearer YOUR_API_KEY
Request Body
{
"model": "meta-llama/Llama-3.3-70B-Instruct",
"messages": [
{ "role": "system", "content": "You are a polite assistant." },
{ "role": "user", "content": "Summarize this support ticket." }
],
"temperature": 0.7,
"max_tokens": 400,
"stream": false
}
Available Parameters
| Field | Required | Description |
|---|
model | Yes | Model ID from the catalog |
messages | Yes | Array of chat messages |
temperature | No | Randomness control (default 0.7) |
max_tokens | No | Maximum output tokens (default 512) |
stream | No | Enable SSE streaming |
metadata | No | Arbitrary JSON metadata for observability |
Responses
{
"id": "chatcmpl-123",
"model": "workspace/model-v1",
"usage": { "prompt_tokens": 200, "completion_tokens": 150, "total_tokens": 350 },
"choices": [
{
"message": { "role": "assistant", "content": "Here is the summary..." },
"finish_reason": "stop",
"index": 0
}
]
}
Errors follow standard HTTP status codes with error and message fields.
cURL Example
curl https://api.llmtune.io/v1/chat/completions \
-H "Authorization: Bearer $LLMTUNE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.3-70B-Instruct",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Draft a product update email." }
],
"temperature": 0.5,
"max_tokens": 300
}'
JavaScript (Fetch)
const response = await fetch('https://api.llmtune.io/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.LLMTUNE_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'meta-llama/Llama-3.3-70B-Instruct',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Draft a product update email.' }
],
temperature: 0.5,
max_tokens: 300
})
});
const data = await response.json();
console.log(data.choices[0].message.content);
Python (requests)
import os
import requests
model_id = "meta-llama/Llama-3.3-70B-Instruct"
api_key = os.environ["LLMTUNE_API_KEY"]
payload = {
"model": model_id,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Draft a product update email."}
],
"temperature": 0.5,
"max_tokens": 300
}
response = requests.post(
"https://api.llmtune.io/v1/chat/completions",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
json=payload,
timeout=30
)
response.raise_for_status()
print(response.json()["choices"][0]["message"]["content"])
Streaming
To enable streaming responses, set "stream": true in the request and consume the SSE stream on the client side. Each event will carry a partial token output until data: [DONE] is emitted.
Next steps