Skip to main content

Deployment Guide

Deploying a fine-tuned model makes it accessible via REST endpoints.

Promote a Run

  1. After a training run completes in FineTune Studio, navigate to LLMTune Deploy (or click Promote to Endpoint from the training job).
  2. Choose environment:
    • Staging – For testing before production
    • Production – For live traffic
  3. Provide a deployment name and optional description.
  4. Configure deployment settings:
    • Version tagging
    • Traffic routing (if deploying multiple versions)
    • Autoscaling parameters (min/max replicas)
    • Timeout settings

Endpoint Configuration

Each deployed model exposes:
  • Inference URL: https://api.llmtune.io/v1/models/{modelId}/inference — use the deployed model ID (or the ID shown in the deployment panel).
  • Supported modes:
    • Single prompt inference (POST with prompt, temperature, maxTokens)
    • Chat completions (OpenAI-compatible) via /chat/completions with the same model ID
    • Batch inference via /batch/inference with modelId
  • Rate limits: See Rate limits; limits may vary by plan.

Managing Versions

LLMTune Deploy supports full version control:
  • Deploy multiple versions simultaneously (v1, v2, etc.)
  • Mark one as default for production traffic
  • Track changes with notes, approvers, and automated rollback states
  • Retire older versions when no longer needed

Traffic Management

Deploy supports advanced traffic routing:
  • Canary deployments – Gradually shift traffic to new versions
  • Shadow deployments – Test new versions without affecting production
  • Blue/Green deployments – Instant switch between versions
  • Traffic splitting – Route percentage of traffic to different versions
All traffic management happens without touching infrastructure – configure it directly in the Deploy interface.

Rollback

To rollback to a previous version:
  1. Open the deployment in LLMTune Deploy.
  2. Select the version you want to rollback to.
  3. Click Rollback or Promote to Production.
  4. The deployment switches instantly – no downtime.
LLMTune tracks all rollback events in the change log for audit purposes.

Runtime Observability

Monitor your deployments in real time:
  • Latency metrics – Track response times and P95/P99 percentiles
  • Spend tracking – Monitor costs per version and time period
  • Error rates – Track failures and anomalies
  • Usage intelligence – Tie metrics to each release for PM and ops alignment
Set up alerting hooks to notify your team when thresholds are breached.

Ops Automation

LLMTune Deploy integrates with your ops workflows:
  • Smoke tests – Automatically run tests after deployment
  • Observability dashboards – Connect to your existing monitoring tools
  • Incident workflows – Trigger alerts and notifications
  • Webhooks – Receive deployment lifecycle events

Best Practices

  1. Start with staging – Always test in staging before promoting to production
  2. Use version control – Tag and document each deployment version
  3. Monitor closely – Watch metrics for the first few minutes after deployment
  4. Plan rollbacks – Know which version to rollback to before deploying
  5. Use traffic management – Gradually roll out changes with canary deployments

Troubleshooting

  • Deployment fails: Check that the training job completed successfully and the model is accessible
  • High latency: Review model size and consider using a smaller model or optimizing inference
  • Errors in production: Use the rollback feature immediately, then investigate in staging
  • Traffic routing issues: Verify traffic split configuration and check version status

Next Steps