Documentation Index
Fetch the complete documentation index at: https://docs.llmtune.io/llms.txt
Use this file to discover all available pages before exploring further.
Troubleshooting Guide
This guide covers common issues you may encounter while using LLMTune and how to resolve them. If you don’t find what you’re looking for, contact support@llmtune.io.Table of Contents
- FineTune Studio Issues
- Deployment Issues
- API & Inference Issues
- Dataset Issues
- Evaluation Issues
- Account & Billing Issues
- Getting Help
FineTune Studio Issues
Training Job Stuck in “Pending” or “Queued”
Symptoms:- Job shows as “Pending” or “Queued” for an extended period
- Queue position is not changing
- No GPU allocation events in the logs
- High demand on compute resources
- Previous job is still running (training queue processes jobs sequentially)
- Insufficient GPU availability
- Check your queue position in the training job details
- Wait for the current job to complete - jobs are processed sequentially to conserve GPU resources
- If waiting longer than expected, check the dashboard for estimated wait times
- Contact support if queue time exceeds the estimated time by more than 30 minutes
Training Job Failed
Symptoms:- Job status shows “Failed” or “Error”
- Error message displayed in the job details
- No checkpoint or model artifact generated
“Dataset validation failed”
Cause: Dataset format doesn’t match the training method requirements. Solution:- Check dataset format in your data library
- Ensure JSONL files have the correct structure (e.g., messages or conversations arrays for SFT/DPO)
- Verify the dataset matches the training method requirements (see Fine-Tuning Guide)
- Use a playground dataset to test if the issue is with your data
”Out of memory” or “CUDA OOM”
Cause: Model or batch size too large for available GPU memory. Solution:- Reduce batch size in training configuration
- Use a smaller base model
- Switch to GPU Cluster for more memory
- Use QLoRA or LoRA for parameter-efficient training
”Learning rate too high” or loss spikes
Cause: Learning rate is too aggressive for the model/dataset. Solution:- Reduce learning rate (try 0.0001 or lower for SFT)
- Check dataset quality and balance
- Enable gradient clipping if available
- Review dataset in your data library for issues
”Connection timeout” or “Network error”
Cause: Network connectivity issues or compute provider outage. Solution:- Check your internet connection
- Try refreshing the page
- If using Federated compute, switch to Traditional compute temporarily
- Contact support if the issue persists
Loss Not Decreasing
Symptoms:- Loss curve remains flat or increases during training
- Model outputs are worse than the base model
- Check dataset quality: Review quality scores in your data library
- Reduce learning rate: Lower values may help convergence
- Increase epochs: Training may need more iterations
- Verify dataset format: Ensure correct structure for your training method
- Try a different base model: Some models may be more suitable for your task
- Use playground dataset: Test with a known-good dataset first
Can’t Find My Model in the Catalog
Symptoms:- Expected model doesn’t appear in FineTune Studio
- Model shows as “unavailable”
- Check if the model is supported (see Model Configuration)
- Filter by provider or modality
- Some models may be temporarily unavailable due to compute constraints
- Contact support for model availability requests
Deployment Issues
Deployment Failed to Create
Symptoms:- Error when promoting a model to an endpoint
- Deployment shows as “Failed” or “Error”
“Training job not completed”
Cause: Trying to deploy a training run that hasn’t finished. Solution:- Wait for the training job to complete successfully
- Check the job status in FineTune Studio
- Only deploy jobs with “Completed” status
”Model artifact not found”
Cause: The model checkpoint wasn’t saved properly during training. Solution:- Check if the training job completed successfully
- Review training logs for errors during checkpoint saving
- Contact support if the artifact appears to be missing
”Insufficient balance”
Cause: Not enough credits to deploy the model. Solution:- Check your balance in the Usage dashboard
- Add credits via Stripe integration
- Consider deploying a smaller model to reduce costs
Endpoint Returns Errors
Symptoms:- API calls to the endpoint fail
- 4xx or 5xx errors returned
“404 Not Found” for endpoint
Cause: Deployment ID is incorrect or endpoint was deleted. Solution:- Verify the deployment ID in the Deploy dashboard
- Check if the deployment is still active
- Ensure you’re using the correct base URL: https://api.llmtune.io/v1
”503 Service Unavailable” or high latency
Cause: Endpoint is overloaded or scaling up. Solution:- Check if the deployment is in “healthy” state
- Review runtime observability metrics
- Consider increasing min replicas in deployment settings
- Implement retry logic in your client
”Model loading” timeout
Cause: Model is taking too long to load. Solution:- Check deployment status - it may still be initializing
- Larger models take longer to load
- If the issue persists, contact support
Can’t Rollback Deployment
Symptoms:- Rollback button is disabled or fails
- Previous version not available
- Ensure the previous version still exists
- Check if you have sufficient permissions
- Review deployment history to see available versions
- Contact support if the rollback is critical
API & Inference Issues
”401 Unauthorized” Error
Symptoms:- API requests fail with authentication errors
- “Invalid API key” message
- Verify the API key is correct
- Check if the key is still active (not revoked)
- Ensure the Authorization header format is correct: Bearer YOUR_API_KEY
- Create a new API key if needed
”402 Payment Required” Error
Symptoms:- API requests fail with payment errors
- Insufficient credits message
- Check your balance in the Usage dashboard
- Add credits via Stripe integration
- Review your usage to understand costs
- Set up automatic top-ups to avoid interruptions
”404 Not Found” for Model
Symptoms:- Model endpoint returns 404
- “Model not found” error
- Verify the model ID is correct
- Check if the deployment is still active
- Ensure you’re using the correct endpoint URL
- Check the Deploy dashboard for the correct deployment ID
”429 Rate Limited” Error
Symptoms:- API requests fail with rate limit errors
- Retry-After header in response
- Implement exponential backoff in your client
- Check the Retry-After header for wait time
- Review your usage patterns
- Consider upgrading your plan for higher limits
Slow Response Times
Symptoms:- API calls take longer than expected
- High latency in responses
- Check deployment runtime observability metrics
- Review model size - larger models have higher latency
- Consider using a smaller, faster model
- Implement caching for repeated requests
- Check if the endpoint is scaling properly
Unexpected Model Outputs
Symptoms:- Model returns incorrect or unexpected responses
- Outputs don’t match fine-tuning data
- Verify the correct deployment is being used
- Check if the model was fine-tuned with appropriate data
- Review evaluation results to understand model behavior
- Consider retraining with improved dataset
- Adjust inference parameters (temperature, top_p)
Dataset Issues
Dataset Upload Fails
Symptoms:- Upload gets stuck or fails
- “Invalid file format” error
- Ensure file format is supported (JSONL, CSV, TXT)
- Check file size limits
- Verify file encoding is UTF-8
- For JSONL, ensure each line is valid JSON
- Try uploading a smaller sample file first
Dataset Validation Errors
Symptoms:- Dataset shows validation errors in your data library
- “Missing required fields” error
- Review the validation error details
- Check that JSONL has required fields (messages or conversations)
- Verify all JSON objects are properly formatted
- Check for special characters that need escaping
- Use the playground dataset format as reference
PII Detection Issues
Symptoms:- Too many or too few PII detections
- False positives/negatives
- Review the PII detection report
- Adjust PII detection sensitivity if available
- Manually review and flag false positives
- Ensure data doesn’t contain actual PII if not intended
- Contact support if detection seems incorrect
Low Quality Score
Symptoms:- Dataset receives low quality score
- Warnings about data quality
- Review quality report details
- Check for duplicate entries
- Ensure consistent formatting across entries
- Verify conversation turns are properly labeled
- Add more diverse examples if coverage is low
- Clean data before upload (remove empty entries, fix formatting)
Evaluation Issues
Evaluation Fails to Run
Symptoms:- Evaluation button doesn’t respond
- “Evaluation failed” error
- Ensure the training job completed successfully
- Check if the model is accessible
- Verify the deployment is active and healthy
- Try a simpler evaluation first (single prompt)
- Check browser console for errors
Comparison Not Working
Symptoms:- Can’t compare with base model
- Base model unavailable
- Ensure the base model is available in the system
- Check if the base model is compatible with your fine-tuned model
- Try using the playground to test the base model
- Contact support if base model should be available
Batch Evaluation Slow
Symptoms:- Batch evaluation takes a long time
- Progress seems stuck
- Batch evaluation processes prompts sequentially
- Check the number of prompts in your batch
- Reduce batch size for faster results
- Ensure deployment can handle the load
- Monitor progress in the evaluation interface
Unexpected Evaluation Results
Symptoms:- Metrics don’t match expectations
- Quality scores seem incorrect
- Verify evaluation prompts are appropriate
- Check if inference parameters are suitable
- Review sample outputs to understand behavior
- Compare with base model for context
- Consider adjusting evaluation criteria
Account & Billing Issues
Can’t Create API Key
Symptoms:- API key creation fails
- “Limit reached” error
- Check how many keys you’ve created
- Delete unused keys if limit reached
- Contact support for key limit increase
- Ensure you have admin permissions
Billing Questions
Symptoms:- Unexpected charges
- Need to understand billing
- Review Usage dashboard for breakdown
- Check billing reports in the dashboard
- Understand training vs inference costs
- Contact support for billing disputes
Can’t Add Credits
Symptoms:- Stripe integration fails
- Payment processing error
- Check if Stripe is enabled in your region
- Verify payment method details
- Try a different payment method
- Contact support if issue persists
Workspace Access Issues
Symptoms:- Can’t access workspace
- “Not authorized” error
- Verify you’re logged into the correct account
- Check if you’ve been invited to the workspace
- Contact workspace admin for access
- Ensure your account is verified
Getting Help
If you can’t resolve your issue with the solutions above, here are additional ways to get help:Contact Support
- Email: support@llmtune.io
- In-app support: Use the support panel in the dashboard
- Response time: Typically within 24 hours for paid plans
When Contacting Support
Include the following information to help us resolve your issue faster:- Workspace ID - Found in workspace settings
- Job/Deployment ID - For training or deployment issues
- Error message - Copy the exact error text
- Timestamps - When did the issue occur?
- Steps to reproduce - What were you doing when it happened?
- Browser/Client info - Browser version or SDK version
Self-Service Resources
- Documentation: Browse the full LLMTune Docs
- FAQ: Check the Frequently Asked Questions
- Roadmap: See what’s coming in the Roadmap
- Community: Join our GitHub Discussions
Status Page
Check the LLMTune Status Page for:- Current system status
- Scheduled maintenance
- Known incidents and outages
Escalation
For urgent production issues affecting your business:- Mark your support ticket as “Urgent”
- Include business impact details
- Enterprise customers have access to priority support
Before contacting support, try reproducing the issue in a different browser or environment. This helps identify if the issue is specific to your setup.