Deployment Guide
Deploying a fine-tuned model makes it accessible via REST endpoints.Promote a Run
- After a training run completes in FineTune Studio, navigate to LLMTune Deploy (or click Promote to Endpoint from the training job).
- Choose environment:
- Staging – For testing before production
- Production – For live traffic
- Provide a deployment name and optional description.
- Configure deployment settings:
- Version tagging
- Traffic routing (if deploying multiple versions)
- Autoscaling parameters (min/max replicas)
- Timeout settings
Endpoint Configuration
Each deployed model exposes:- Inference URL:
https://api.llmtune.io/v1/models/{modelId}/inference— use the deployed model ID (or the ID shown in the deployment panel). - Supported modes:
- Single prompt inference (
POSTwithprompt,temperature,maxTokens) - Chat completions (OpenAI-compatible) via
/chat/completionswith the same model ID - Batch inference via
/batch/inferencewithmodelId
- Single prompt inference (
- Rate limits: See Rate limits; limits may vary by plan.
Managing Versions
LLMTune Deploy supports full version control:- Deploy multiple versions simultaneously (v1, v2, etc.)
- Mark one as default for production traffic
- Track changes with notes, approvers, and automated rollback states
- Retire older versions when no longer needed
Traffic Management
Deploy supports advanced traffic routing:- Canary deployments – Gradually shift traffic to new versions
- Shadow deployments – Test new versions without affecting production
- Blue/Green deployments – Instant switch between versions
- Traffic splitting – Route percentage of traffic to different versions
Rollback
To rollback to a previous version:- Open the deployment in LLMTune Deploy.
- Select the version you want to rollback to.
- Click Rollback or Promote to Production.
- The deployment switches instantly – no downtime.
Runtime Observability
Monitor your deployments in real time:- Latency metrics – Track response times and P95/P99 percentiles
- Spend tracking – Monitor costs per version and time period
- Error rates – Track failures and anomalies
- Usage intelligence – Tie metrics to each release for PM and ops alignment
Ops Automation
LLMTune Deploy integrates with your ops workflows:- Smoke tests – Automatically run tests after deployment
- Observability dashboards – Connect to your existing monitoring tools
- Incident workflows – Trigger alerts and notifications
- Webhooks – Receive deployment lifecycle events
Best Practices
- Start with staging – Always test in staging before promoting to production
- Use version control – Tag and document each deployment version
- Monitor closely – Watch metrics for the first few minutes after deployment
- Plan rollbacks – Know which version to rollback to before deploying
- Use traffic management – Gradually roll out changes with canary deployments
Troubleshooting
- Deployment fails: Check that the training job completed successfully and the model is accessible
- High latency: Review model size and consider using a smaller model or optimizing inference
- Errors in production: Use the rollback feature immediately, then investigate in staging
- Traffic routing issues: Verify traffic split configuration and check version status
Next Steps
- Learn about Evaluate to test deployments before promoting
- Read the Inference API Guide for integration details
- Set up Webhooks for deployment automation
- Check the API documentation for programmatic deployment management