Troubleshooting Guide

This guide covers common issues you may encounter while using LLMTune and how to resolve them. If you don’t find what you’re looking for, contact support@llmtune.io.

FineTune Studio Issues
Deployment Issues
API & Inference Issues
Dataset Issues
Evaluation Issues
Account & Billing Issues
Getting Help

FineTune Studio Issues

Training Job Stuck in “Pending” or “Queued”

Symptoms:

Job shows as “Pending” or “Queued” for an extended period
Queue position is not changing
No GPU allocation events in the logs

Causes:

High demand on compute resources
Previous job is still running (training queue processes jobs sequentially)
Insufficient GPU availability

Solutions:

Check your queue position in the training job details
Wait for the current job to complete - jobs are processed sequentially to conserve GPU resources
If waiting longer than expected, check the dashboard for estimated wait times
Contact support if queue time exceeds the estimated time by more than 30 minutes

Training Job Failed

Symptoms:

Job status shows “Failed” or “Error”
Error message displayed in the job details
No checkpoint or model artifact generated

Common Errors:

“Dataset validation failed”

Cause: Dataset format doesn’t match the training method requirements. Solution:

Check dataset format in your data library
Ensure JSONL files have the correct structure (e.g., messages or conversations arrays for SFT/DPO)
Verify the dataset matches the training method requirements (see Fine-Tuning Guide)
Use a playground dataset to test if the issue is with your data

”Out of memory” or “CUDA OOM”

Cause: Model or batch size too large for available GPU memory. Solution:

Reduce batch size in training configuration
Use a smaller base model
Switch to GPU Cluster for more memory
Use QLoRA or LoRA for parameter-efficient training

”Learning rate too high” or loss spikes

Cause: Learning rate is too aggressive for the model/dataset. Solution:

Reduce learning rate (try 0.0001 or lower for SFT)
Check dataset quality and balance
Enable gradient clipping if available
Review dataset in your data library for issues

”Connection timeout” or “Network error”

Cause: Network connectivity issues or compute provider outage. Solution:

Check your internet connection
Try refreshing the page
If using Federated compute, switch to Traditional compute temporarily
Contact support if the issue persists

Loss Not Decreasing

Symptoms:

Loss curve remains flat or increases during training
Model outputs are worse than the base model

Solutions:

Check dataset quality: Review quality scores in your data library
Reduce learning rate: Lower values may help convergence
Increase epochs: Training may need more iterations
Verify dataset format: Ensure correct structure for your training method
Try a different base model: Some models may be more suitable for your task
Use playground dataset: Test with a known-good dataset first

Can’t Find My Model in the Catalog

Symptoms:

Expected model doesn’t appear in FineTune Studio
Model shows as “unavailable”

Solutions:

Check if the model is supported (see Model Configuration)
Filter by provider or modality
Some models may be temporarily unavailable due to compute constraints
Contact support for model availability requests

Deployment Issues

Deployment Failed to Create

Symptoms:

Error when promoting a model to an endpoint
Deployment shows as “Failed” or “Error”

Common Causes:

“Training job not completed”

Cause: Trying to deploy a training run that hasn’t finished. Solution:

Wait for the training job to complete successfully
Check the job status in FineTune Studio
Only deploy jobs with “Completed” status

”Model artifact not found”

Cause: The model checkpoint wasn’t saved properly during training. Solution:

Check if the training job completed successfully
Review training logs for errors during checkpoint saving
Contact support if the artifact appears to be missing

”Insufficient balance”

Cause: Not enough credits to deploy the model. Solution:

Check your balance in the Usage dashboard
Add credits via Stripe integration
Consider deploying a smaller model to reduce costs

Endpoint Returns Errors

Symptoms:

API calls to the endpoint fail
4xx or 5xx errors returned

Common Errors:

“404 Not Found” for endpoint

Cause: Deployment ID is incorrect or endpoint was deleted. Solution:

Verify the deployment ID in the Deploy dashboard
Check if the deployment is still active
Ensure you’re using the correct base URL: https://api.llmtune.io/v1

”503 Service Unavailable” or high latency

Cause: Endpoint is overloaded or scaling up. Solution:

Check if the deployment is in “healthy” state
Review runtime observability metrics
Consider increasing min replicas in deployment settings
Implement retry logic in your client

”Model loading” timeout

Cause: Model is taking too long to load. Solution:

Check deployment status - it may still be initializing
Larger models take longer to load
If the issue persists, contact support

Can’t Rollback Deployment

Symptoms:

Rollback button is disabled or fails
Previous version not available

Solutions:

Ensure the previous version still exists
Check if you have sufficient permissions
Review deployment history to see available versions
Contact support if the rollback is critical

API & Inference Issues

”401 Unauthorized” Error

Symptoms:

API requests fail with authentication errors
“Invalid API key” message

Solutions:

Verify the API key is correct
Check if the key is still active (not revoked)
Ensure the Authorization header format is correct: Bearer YOUR_API_KEY
Create a new API key if needed

”402 Payment Required” Error

Symptoms:

API requests fail with payment errors
Insufficient credits message

Solutions:

Check your balance in the Usage dashboard
Add credits via Stripe integration
Review your usage to understand costs
Set up automatic top-ups to avoid interruptions

”404 Not Found” for Model

Symptoms:

Model endpoint returns 404
“Model not found” error

Solutions:

Verify the model ID is correct
Check if the deployment is still active
Ensure you’re using the correct endpoint URL
Check the Deploy dashboard for the correct deployment ID

”429 Rate Limited” Error

Symptoms:

API requests fail with rate limit errors
Retry-After header in response

Solutions:

Implement exponential backoff in your client
Check the Retry-After header for wait time
Review your usage patterns
Consider upgrading your plan for higher limits

Slow Response Times

Symptoms:

API calls take longer than expected
High latency in responses

Solutions:

Check deployment runtime observability metrics
Review model size - larger models have higher latency
Consider using a smaller, faster model
Implement caching for repeated requests
Check if the endpoint is scaling properly

Unexpected Model Outputs

Symptoms:

Model returns incorrect or unexpected responses
Outputs don’t match fine-tuning data

Solutions:

Verify the correct deployment is being used
Check if the model was fine-tuned with appropriate data
Review evaluation results to understand model behavior
Consider retraining with improved dataset
Adjust inference parameters (temperature, top_p)

Dataset Issues

Dataset Upload Fails

Symptoms:

Upload gets stuck or fails
“Invalid file format” error

Solutions:

Ensure file format is supported (JSONL, CSV, TXT)
Check file size limits
Verify file encoding is UTF-8
For JSONL, ensure each line is valid JSON
Try uploading a smaller sample file first

Dataset Validation Errors

Symptoms:

Dataset shows validation errors in your data library
“Missing required fields” error

Solutions:

Review the validation error details
Check that JSONL has required fields (messages or conversations)
Verify all JSON objects are properly formatted
Check for special characters that need escaping
Use the playground dataset format as reference

PII Detection Issues

Symptoms:

Too many or too few PII detections
False positives/negatives

Solutions:

Review the PII detection report
Adjust PII detection sensitivity if available
Manually review and flag false positives
Ensure data doesn’t contain actual PII if not intended
Contact support if detection seems incorrect

Low Quality Score

Symptoms:

Dataset receives low quality score
Warnings about data quality

Solutions:

Review quality report details
Check for duplicate entries
Ensure consistent formatting across entries
Verify conversation turns are properly labeled
Add more diverse examples if coverage is low
Clean data before upload (remove empty entries, fix formatting)

Evaluation Issues

Evaluation Fails to Run

Symptoms:

Evaluation button doesn’t respond
“Evaluation failed” error

Solutions:

Ensure the training job completed successfully
Check if the model is accessible
Verify the deployment is active and healthy
Try a simpler evaluation first (single prompt)
Check browser console for errors

Comparison Not Working

Symptoms:

Can’t compare with base model
Base model unavailable

Solutions:

Ensure the base model is available in the system
Check if the base model is compatible with your fine-tuned model
Try using the playground to test the base model
Contact support if base model should be available

Batch Evaluation Slow

Symptoms:

Batch evaluation takes a long time
Progress seems stuck

Solutions:

Batch evaluation processes prompts sequentially
Check the number of prompts in your batch
Reduce batch size for faster results
Ensure deployment can handle the load
Monitor progress in the evaluation interface

Unexpected Evaluation Results

Symptoms:

Metrics don’t match expectations
Quality scores seem incorrect

Solutions:

Verify evaluation prompts are appropriate
Check if inference parameters are suitable
Review sample outputs to understand behavior
Compare with base model for context
Consider adjusting evaluation criteria

Account & Billing Issues

Can’t Create API Key

Symptoms:

API key creation fails
“Limit reached” error

Solutions:

Check how many keys you’ve created
Delete unused keys if limit reached
Contact support for key limit increase
Ensure you have admin permissions

Billing Questions

Symptoms:

Unexpected charges
Need to understand billing

Solutions:

Review Usage dashboard for breakdown
Check billing reports in the dashboard
Understand training vs inference costs
Contact support for billing disputes

Can’t Add Credits

Symptoms:

Stripe integration fails
Payment processing error

Solutions:

Check if Stripe is enabled in your region
Verify payment method details
Try a different payment method
Contact support if issue persists

Workspace Access Issues

Symptoms:

Can’t access workspace
“Not authorized” error

Solutions:

Verify you’re logged into the correct account
Check if you’ve been invited to the workspace
Contact workspace admin for access
Ensure your account is verified

Getting Help

If you can’t resolve your issue with the solutions above, here are additional ways to get help:

Contact Support

Email: support@llmtune.io
In-app support: Use the support panel in the dashboard
Response time: Typically within 24 hours for paid plans

When Contacting Support

Include the following information to help us resolve your issue faster:

Workspace ID - Found in workspace settings
Job/Deployment ID - For training or deployment issues
Error message - Copy the exact error text
Timestamps - When did the issue occur?
Steps to reproduce - What were you doing when it happened?
Browser/Client info - Browser version or SDK version

Self-Service Resources

Documentation: Browse the full LLMTune Docs
FAQ: Check the Frequently Asked Questions
Roadmap: See what’s coming in the Roadmap
Community: Join our GitHub Discussions

Status Page

Check the LLMTune Status Page for:

Current system status
Scheduled maintenance
Known incidents and outages

Escalation

For urgent production issues affecting your business:

Mark your support ticket as “Urgent”
Include business impact details
Enterprise customers have access to priority support

Before contacting support, try reproducing the issue in a different browser or environment. This helps identify if the issue is specific to your setup.

Getting started

Setup

Core concepts

How-to guides

Documentation Index

​Troubleshooting Guide

​Table of Contents

​FineTune Studio Issues

​Training Job Stuck in “Pending” or “Queued”

​Training Job Failed

​“Dataset validation failed”

​”Out of memory” or “CUDA OOM”

​”Learning rate too high” or loss spikes

​”Connection timeout” or “Network error”

​Loss Not Decreasing

​Can’t Find My Model in the Catalog

​Deployment Issues

​Deployment Failed to Create

​“Training job not completed”

​”Model artifact not found”

​”Insufficient balance”

​Endpoint Returns Errors

​“404 Not Found” for endpoint

​”503 Service Unavailable” or high latency

​”Model loading” timeout

​Can’t Rollback Deployment

​API & Inference Issues

​”401 Unauthorized” Error

​”402 Payment Required” Error

​”404 Not Found” for Model

​”429 Rate Limited” Error

​Slow Response Times

​Unexpected Model Outputs

​Dataset Issues

​Dataset Upload Fails

​Dataset Validation Errors

​PII Detection Issues

​Low Quality Score

​Evaluation Issues

​Evaluation Fails to Run

​Comparison Not Working

​Batch Evaluation Slow

​Unexpected Evaluation Results

​Account & Billing Issues

​Can’t Create API Key

​Billing Questions

​Can’t Add Credits

​Workspace Access Issues

​Getting Help

​Contact Support

​When Contacting Support

​Self-Service Resources

​Status Page

​Escalation

Troubleshooting Guide

Table of Contents

FineTune Studio Issues

Training Job Stuck in “Pending” or “Queued”

Training Job Failed

“Dataset validation failed”

”Out of memory” or “CUDA OOM”

”Learning rate too high” or loss spikes

”Connection timeout” or “Network error”

Loss Not Decreasing

Can’t Find My Model in the Catalog

Deployment Issues

Deployment Failed to Create

“Training job not completed”

”Model artifact not found”

”Insufficient balance”

Endpoint Returns Errors

“404 Not Found” for endpoint

”503 Service Unavailable” or high latency

”Model loading” timeout

Can’t Rollback Deployment

API & Inference Issues

”401 Unauthorized” Error

”402 Payment Required” Error

”404 Not Found” for Model

”429 Rate Limited” Error

Slow Response Times

Unexpected Model Outputs

Dataset Issues

Dataset Upload Fails

Dataset Validation Errors

PII Detection Issues

Low Quality Score

Evaluation Issues

Evaluation Fails to Run

Comparison Not Working

Batch Evaluation Slow

Unexpected Evaluation Results

Account & Billing Issues

Can’t Create API Key

Billing Questions

Can’t Add Credits

Workspace Access Issues

Getting Help

Contact Support

When Contacting Support

Self-Service Resources

Status Page

Escalation