Troubleshooting Guide
This guide covers common issues you may encounter while using LLMTune and how to resolve them. If you don’t find what you’re looking for, contact support@llmtune.io.Table of Contents
- FineTune Studio Issues
- Deployment Issues
- API & Inference Issues
- Dataset Issues
- Evaluation Issues
- Account & Billing Issues
- Getting Help
FineTune Studio Issues
Training Job Stuck in “Pending” or “Queued”
Symptoms:- Job shows as “Pending” or “Queued” for an extended period
- Queue position is not changing
- No GPU allocation events in the logs
- High demand on compute resources
- Previous job is still running (training queue processes jobs sequentially)
- Insufficient GPU availability
- Check your queue position in the training job details
- Wait for the current job to complete - jobs are processed sequentially to conserve GPU resources
- If waiting longer than expected, check the dashboard for estimated wait times
- Contact support if queue time exceeds the estimated time by more than 30 minutes
Training Job Failed
Symptoms:- Job status shows “Failed” or “Error”
- Error message displayed in the job details
- No checkpoint or model artifact generated
“Dataset validation failed”
Cause: Dataset format doesn’t match the training method requirements. Solution:- Check dataset format in your data library
- Ensure JSONL files have the correct structure (e.g., messages or conversations arrays for SFT/DPO)
- Verify the dataset matches the training method requirements (see Fine-Tuning Guide)
- Use a playground dataset to test if the issue is with your data
”Out of memory” or “CUDA OOM”
Cause: Model or batch size too large for available GPU memory. Solution:- Reduce batch size in training configuration
- Use a smaller base model
- Switch to GPU Cluster for more memory
- Use QLoRA or LoRA for parameter-efficient training
”Learning rate too high” or loss spikes
Cause: Learning rate is too aggressive for the model/dataset. Solution:- Reduce learning rate (try 0.0001 or lower for SFT)
- Check dataset quality and balance
- Enable gradient clipping if available
- Review dataset in your data library for issues
”Connection timeout” or “Network error”
Cause: Network connectivity issues or compute provider outage. Solution:- Check your internet connection
- Try refreshing the page
- If using Federated compute, switch to Traditional compute temporarily
- Contact support if the issue persists
Loss Not Decreasing
Symptoms:- Loss curve remains flat or increases during training
- Model outputs are worse than the base model
- Check dataset quality: Review quality scores in your data library
- Reduce learning rate: Lower values may help convergence
- Increase epochs: Training may need more iterations
- Verify dataset format: Ensure correct structure for your training method
- Try a different base model: Some models may be more suitable for your task
- Use playground dataset: Test with a known-good dataset first
Can’t Find My Model in the Catalog
Symptoms:- Expected model doesn’t appear in FineTune Studio
- Model shows as “unavailable”
- Check if the model is supported (see Model Configuration)
- Filter by provider or modality
- Some models may be temporarily unavailable due to compute constraints
- Contact support for model availability requests
Deployment Issues
Deployment Failed to Create
Symptoms:- Error when promoting a model to an endpoint
- Deployment shows as “Failed” or “Error”
“Training job not completed”
Cause: Trying to deploy a training run that hasn’t finished. Solution:- Wait for the training job to complete successfully
- Check the job status in FineTune Studio
- Only deploy jobs with “Completed” status
”Model artifact not found”
Cause: The model checkpoint wasn’t saved properly during training. Solution:- Check if the training job completed successfully
- Review training logs for errors during checkpoint saving
- Contact support if the artifact appears to be missing
”Insufficient balance”
Cause: Not enough credits to deploy the model. Solution:- Check your balance in the Usage dashboard
- Add credits via Stripe integration
- Consider deploying a smaller model to reduce costs
Endpoint Returns Errors
Symptoms:- API calls to the endpoint fail
- 4xx or 5xx errors returned
“404 Not Found” for endpoint
Cause: Deployment ID is incorrect or endpoint was deleted. Solution:- Verify the deployment ID in the Deploy dashboard
- Check if the deployment is still active
- Ensure you’re using the correct base URL: https://api.llmtune.io/v1
”503 Service Unavailable” or high latency
Cause: Endpoint is overloaded or scaling up. Solution:- Check if the deployment is in “healthy” state
- Review runtime observability metrics
- Consider increasing min replicas in deployment settings
- Implement retry logic in your client
”Model loading” timeout
Cause: Model is taking too long to load. Solution:- Check deployment status - it may still be initializing
- Larger models take longer to load
- If the issue persists, contact support
Can’t Rollback Deployment
Symptoms:- Rollback button is disabled or fails
- Previous version not available
- Ensure the previous version still exists
- Check if you have sufficient permissions
- Review deployment history to see available versions
- Contact support if the rollback is critical
API & Inference Issues
”401 Unauthorized” Error
Symptoms:- API requests fail with authentication errors
- “Invalid API key” message
- Verify the API key is correct
- Check if the key is still active (not revoked)
- Ensure the Authorization header format is correct: Bearer YOUR_API_KEY
- Create a new API key if needed
”402 Payment Required” Error
Symptoms:- API requests fail with payment errors
- Insufficient credits message
- Check your balance in the Usage dashboard
- Add credits via Stripe integration
- Review your usage to understand costs
- Set up automatic top-ups to avoid interruptions
”404 Not Found” for Model
Symptoms:- Model endpoint returns 404
- “Model not found” error
- Verify the model ID is correct
- Check if the deployment is still active
- Ensure you’re using the correct endpoint URL
- Check the Deploy dashboard for the correct deployment ID
”429 Rate Limited” Error
Symptoms:- API requests fail with rate limit errors
- Retry-After header in response
- Implement exponential backoff in your client
- Check the Retry-After header for wait time
- Review your usage patterns
- Consider upgrading your plan for higher limits
Slow Response Times
Symptoms:- API calls take longer than expected
- High latency in responses
- Check deployment runtime observability metrics
- Review model size - larger models have higher latency
- Consider using a smaller, faster model
- Implement caching for repeated requests
- Check if the endpoint is scaling properly
Unexpected Model Outputs
Symptoms:- Model returns incorrect or unexpected responses
- Outputs don’t match fine-tuning data
- Verify the correct deployment is being used
- Check if the model was fine-tuned with appropriate data
- Review evaluation results to understand model behavior
- Consider retraining with improved dataset
- Adjust inference parameters (temperature, top_p)
Dataset Issues
Dataset Upload Fails
Symptoms:- Upload gets stuck or fails
- “Invalid file format” error
- Ensure file format is supported (JSONL, CSV, TXT)
- Check file size limits
- Verify file encoding is UTF-8
- For JSONL, ensure each line is valid JSON
- Try uploading a smaller sample file first
Dataset Validation Errors
Symptoms:- Dataset shows validation errors in your data library
- “Missing required fields” error
- Review the validation error details
- Check that JSONL has required fields (messages or conversations)
- Verify all JSON objects are properly formatted
- Check for special characters that need escaping
- Use the playground dataset format as reference
PII Detection Issues
Symptoms:- Too many or too few PII detections
- False positives/negatives
- Review the PII detection report
- Adjust PII detection sensitivity if available
- Manually review and flag false positives
- Ensure data doesn’t contain actual PII if not intended
- Contact support if detection seems incorrect
Low Quality Score
Symptoms:- Dataset receives low quality score
- Warnings about data quality
- Review quality report details
- Check for duplicate entries
- Ensure consistent formatting across entries
- Verify conversation turns are properly labeled
- Add more diverse examples if coverage is low
- Clean data before upload (remove empty entries, fix formatting)
Evaluation Issues
Evaluation Fails to Run
Symptoms:- Evaluation button doesn’t respond
- “Evaluation failed” error
- Ensure the training job completed successfully
- Check if the model is accessible
- Verify the deployment is active and healthy
- Try a simpler evaluation first (single prompt)
- Check browser console for errors
Comparison Not Working
Symptoms:- Can’t compare with base model
- Base model unavailable
- Ensure the base model is available in the system
- Check if the base model is compatible with your fine-tuned model
- Try using the playground to test the base model
- Contact support if base model should be available
Batch Evaluation Slow
Symptoms:- Batch evaluation takes a long time
- Progress seems stuck
- Batch evaluation processes prompts sequentially
- Check the number of prompts in your batch
- Reduce batch size for faster results
- Ensure deployment can handle the load
- Monitor progress in the evaluation interface
Unexpected Evaluation Results
Symptoms:- Metrics don’t match expectations
- Quality scores seem incorrect
- Verify evaluation prompts are appropriate
- Check if inference parameters are suitable
- Review sample outputs to understand behavior
- Compare with base model for context
- Consider adjusting evaluation criteria
Account & Billing Issues
Can’t Create API Key
Symptoms:- API key creation fails
- “Limit reached” error
- Check how many keys you’ve created
- Delete unused keys if limit reached
- Contact support for key limit increase
- Ensure you have admin permissions
Billing Questions
Symptoms:- Unexpected charges
- Need to understand billing
- Review Usage dashboard for breakdown
- Check billing reports in the dashboard
- Understand training vs inference costs
- Contact support for billing disputes
Can’t Add Credits
Symptoms:- Stripe integration fails
- Payment processing error
- Check if Stripe is enabled in your region
- Verify payment method details
- Try a different payment method
- Contact support if issue persists
Workspace Access Issues
Symptoms:- Can’t access workspace
- “Not authorized” error
- Verify you’re logged into the correct account
- Check if you’ve been invited to the workspace
- Contact workspace admin for access
- Ensure your account is verified
Getting Help
If you can’t resolve your issue with the solutions above, here are additional ways to get help:Contact Support
- Email: support@llmtune.io
- In-app support: Use the support panel in the dashboard
- Response time: Typically within 24 hours for paid plans
When Contacting Support
Include the following information to help us resolve your issue faster:- Workspace ID - Found in workspace settings
- Job/Deployment ID - For training or deployment issues
- Error message - Copy the exact error text
- Timestamps - When did the issue occur?
- Steps to reproduce - What were you doing when it happened?
- Browser/Client info - Browser version or SDK version
Self-Service Resources
- Documentation: Browse the full LLMTune Docs
- FAQ: Check the Frequently Asked Questions
- Roadmap: See what’s coming in the Roadmap
- Community: Join our GitHub Discussions
Status Page
Check the LLMTune Status Page for:- Current system status
- Scheduled maintenance
- Known incidents and outages
Escalation
For urgent production issues affecting your business:- Mark your support ticket as “Urgent”
- Include business impact details
- Enterprise customers have access to priority support
Before contacting support, try reproducing the issue in a different browser or environment. This helps identify if the issue is specific to your setup.