Deployment Guide

Deploying a fine-tuned model makes it accessible via REST endpoints.

Promote a Run

After a training run completes in FineTune Studio, navigate to LLMTune Deploy (or click Promote to Endpoint from the training job).
Choose environment:
- Staging – For testing before production
- Production – For live traffic
Provide a deployment name and optional description.
Configure deployment settings:
- Version tagging
- Traffic routing (if deploying multiple versions)
- Autoscaling parameters (min/max replicas)
- Timeout settings

Endpoint Configuration

Each deployed model exposes:

Inference URL: https://api.llmtune.io/v1/models/{modelId}/inference — use the deployed model ID (or the ID shown in the deployment panel).
Supported modes:
- Single prompt inference (POST with prompt, temperature, maxTokens)
- Chat completions (OpenAI-compatible) via /chat/completions with the same model ID
- Batch inference via /batch/inference with modelId
Rate limits: See Rate limits; limits may vary by plan.

Managing Versions

LLMTune Deploy supports full version control:

Deploy multiple versions simultaneously (v1, v2, etc.)
Mark one as default for production traffic
Track changes with notes, approvers, and automated rollback states
Retire older versions when no longer needed

Traffic Management

Deploy supports advanced traffic routing:

Canary deployments – Gradually shift traffic to new versions
Shadow deployments – Test new versions without affecting production
Blue/Green deployments – Instant switch between versions
Traffic splitting – Route percentage of traffic to different versions

All traffic management happens without touching infrastructure – configure it directly in the Deploy interface.

Rollback

To rollback to a previous version:

Open the deployment in LLMTune Deploy.
Select the version you want to rollback to.
Click Rollback or Promote to Production.
The deployment switches instantly – no downtime.

LLMTune tracks all rollback events in the change log for audit purposes.

Runtime Observability

Monitor your deployments in real time:

Latency metrics – Track response times and P95/P99 percentiles
Spend tracking – Monitor costs per version and time period
Error rates – Track failures and anomalies
Usage intelligence – Tie metrics to each release for PM and ops alignment

Set up alerting hooks to notify your team when thresholds are breached.

Ops Automation

LLMTune Deploy integrates with your ops workflows:

Smoke tests – Automatically run tests after deployment
Observability dashboards – Connect to your existing monitoring tools
Incident workflows – Trigger alerts and notifications
Webhooks – Receive deployment lifecycle events

Best Practices

Start with staging – Always test in staging before promoting to production
Use version control – Tag and document each deployment version
Monitor closely – Watch metrics for the first few minutes after deployment
Plan rollbacks – Know which version to rollback to before deploying
Use traffic management – Gradually roll out changes with canary deployments

Troubleshooting

Deployment fails: Check that the training job completed successfully and the model is accessible
High latency: Review model size and consider using a smaller model or optimizing inference
Errors in production: Use the rollback feature immediately, then investigate in staging
Traffic routing issues: Verify traffic split configuration and check version status

Next Steps

Learn about Evaluate to test deployments before promoting
Read the Inference API Guide for integration details
Set up Webhooks for deployment automation
Check the API documentation for programmatic deployment management

​Deployment Guide

​Promote a Run

​Endpoint Configuration

​Managing Versions

​Traffic Management

​Rollback

​Runtime Observability

​Ops Automation

​Best Practices

​Troubleshooting

​Next Steps