Skip to main content

Inference API Guide

LLMTune provides an OpenAI-compatible chat completion API for deployed models. This guide shows how to call it with cURL, JavaScript, and Python.

Base URL

The public LLMTune API is available at:
https://api.llmtune.io/v1
Note: The in-app routes under https://llmtune.io/api/... are used by the LLMTune web application. For external integrations, use the https://api.llmtune.io/v1 base URL.

Endpoint

POST /v1/models/{modelId}/inference
Replace {modelId} with your deployed model ID (e.g., meta-llama/Llama-3.3-70B-Instruct or your fine-tuned model ID).

Authentication

Include a Bearer token created under API Keys in the LLMTune dashboard.
Authorization: Bearer YOUR_API_KEY

Request Body

{
  "prompt": "Summarize LLMTune in one sentence.",
  "temperature": 0.7,
  "maxTokens": 200,
  "topP": 1.0,
  "topK": 50,
  "metadata": { "conversationId": "optional" }
}

Available Parameters

FieldRequiredDescriptionDefault
promptYesInput prompt string-
temperatureNoSampling temperature (0-2), higher = more creative0.7
maxTokensNoMaximum output tokens1024
topPNoNucleus sampling parameter1.0
topKNoTop-K sampling limit50
metadataNoArbitrary JSON metadata for observabilitynull

Response Format

{
  "text": "Generated response...",
  "tokens": 228,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "latency": 234,
  "metadata": { "conversationId": "optional" }
}
Errors follow standard HTTP status codes with error and message fields:
{
  "error": {
    "message": "Invalid API key",
    "code": "UNAUTHORIZED"
  }
}

cURL Example

curl https://api.llmtune.io/v1/models/meta-llama/Llama-3.3-70B-Instruct/inference \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Summarize LLMTune in one sentence.",
    "temperature": 0.7,
    "maxTokens": 200
  }'

JavaScript (Fetch)

const response = await fetch(
  'https://api.llmtune.io/v1/models/meta-llama/Llama-3.3-70B-Instruct/inference',
  {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.LLMTUNE_API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      prompt: 'Summarize LLMTune in one sentence.',
      temperature: 0.7,
      maxTokens: 200
    })
  }
);

const data = await response.json();
console.log(data.text);

Python (requests)

import os
import requests

api_key = os.environ["LLMTUNE_API_KEY"]
model_id = "meta-llama/Llama-3.3-70B-Instruct"

payload = {
    "prompt": "Summarize LLMTune in one sentence.",
    "temperature": 0.7,
    "maxTokens": 200
}

response = requests.post(
    f"https://api.llmtune.io/v1/models/{model_id}/inference",
    headers={
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    },
    json=payload,
    timeout=30
)

response.raise_for_status()
print(response.json()["text"])

Playground Inference

For quick smoke tests, use the playground endpoint:
curl https://api.llmtune.io/v1/playground/inference \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "modelId": "meta-llama/Llama-3.3-70B-Instruct",
    "prompt": "Hello, world!",
    "temperature": 0.7,
    "maxTokens": 800
  }'

Batch Inference

Submit up to 100 inference jobs per call:
curl https://api.llmtune.io/v1/batch/inference \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "modelId": "meta-llama/Llama-3.3-70B-Instruct",
    "requests": [
      { "id": "req-1", "prompt": "First prompt" },
      { "id": "req-2", "prompt": "Second prompt" }
    ],
    "webhookUrl": "https://app.yourdomain.com/batch-callback"
  }'
Response:
{
  "batchId": "batch-uuid",
  "status": "queued",
  "summary": { "total": 2, "accepted": 2 }
}

Error Handling

Common error codes:
  • 401 Unauthorized – Invalid or missing API key
  • 402 Payment Required – Insufficient credits
  • 404 Not Found – Model or job ID not found
  • 429 Rate Limited – Rate limit exceeded (check Retry-After header)
  • 500 Server Error – Unexpected issue (retry with exponential backoff)

Rate Limits

Rate limits vary by plan:
  • Sandbox – Lower limits for experimentation
  • Growth / Production – Higher limits for production traffic
  • Enterprise – Custom limits and SLAs
Rate limit headers are included in responses:
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 999
X-RateLimit-Reset: 1640995200

Next Steps