Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.llmtune.io/llms.txt

Use this file to discover all available pages before exploring further.

Troubleshooting Guide

This guide covers common issues you may encounter while using LLMTune and how to resolve them. If you don’t find what you’re looking for, contact support@llmtune.io.

Table of Contents


FineTune Studio Issues

Training Job Stuck in “Pending” or “Queued”

Symptoms:
  • Job shows as “Pending” or “Queued” for an extended period
  • Queue position is not changing
  • No GPU allocation events in the logs
Causes:
  • High demand on compute resources
  • Previous job is still running (training queue processes jobs sequentially)
  • Insufficient GPU availability
Solutions:
  1. Check your queue position in the training job details
  2. Wait for the current job to complete - jobs are processed sequentially to conserve GPU resources
  3. If waiting longer than expected, check the dashboard for estimated wait times
  4. Contact support if queue time exceeds the estimated time by more than 30 minutes

Training Job Failed

Symptoms:
  • Job status shows “Failed” or “Error”
  • Error message displayed in the job details
  • No checkpoint or model artifact generated
Common Errors:

“Dataset validation failed”

Cause: Dataset format doesn’t match the training method requirements. Solution:
  1. Check dataset format in your data library
  2. Ensure JSONL files have the correct structure (e.g., messages or conversations arrays for SFT/DPO)
  3. Verify the dataset matches the training method requirements (see Fine-Tuning Guide)
  4. Use a playground dataset to test if the issue is with your data

”Out of memory” or “CUDA OOM”

Cause: Model or batch size too large for available GPU memory. Solution:
  1. Reduce batch size in training configuration
  2. Use a smaller base model
  3. Switch to GPU Cluster for more memory
  4. Use QLoRA or LoRA for parameter-efficient training

”Learning rate too high” or loss spikes

Cause: Learning rate is too aggressive for the model/dataset. Solution:
  1. Reduce learning rate (try 0.0001 or lower for SFT)
  2. Check dataset quality and balance
  3. Enable gradient clipping if available
  4. Review dataset in your data library for issues

”Connection timeout” or “Network error”

Cause: Network connectivity issues or compute provider outage. Solution:
  1. Check your internet connection
  2. Try refreshing the page
  3. If using Federated compute, switch to Traditional compute temporarily
  4. Contact support if the issue persists

Loss Not Decreasing

Symptoms:
  • Loss curve remains flat or increases during training
  • Model outputs are worse than the base model
Solutions:
  1. Check dataset quality: Review quality scores in your data library
  2. Reduce learning rate: Lower values may help convergence
  3. Increase epochs: Training may need more iterations
  4. Verify dataset format: Ensure correct structure for your training method
  5. Try a different base model: Some models may be more suitable for your task
  6. Use playground dataset: Test with a known-good dataset first

Can’t Find My Model in the Catalog

Symptoms:
  • Expected model doesn’t appear in FineTune Studio
  • Model shows as “unavailable”
Solutions:
  1. Check if the model is supported (see Model Configuration)
  2. Filter by provider or modality
  3. Some models may be temporarily unavailable due to compute constraints
  4. Contact support for model availability requests

Deployment Issues

Deployment Failed to Create

Symptoms:
  • Error when promoting a model to an endpoint
  • Deployment shows as “Failed” or “Error”
Common Causes:

“Training job not completed”

Cause: Trying to deploy a training run that hasn’t finished. Solution:
  1. Wait for the training job to complete successfully
  2. Check the job status in FineTune Studio
  3. Only deploy jobs with “Completed” status

”Model artifact not found”

Cause: The model checkpoint wasn’t saved properly during training. Solution:
  1. Check if the training job completed successfully
  2. Review training logs for errors during checkpoint saving
  3. Contact support if the artifact appears to be missing

”Insufficient balance”

Cause: Not enough credits to deploy the model. Solution:
  1. Check your balance in the Usage dashboard
  2. Add credits via Stripe integration
  3. Consider deploying a smaller model to reduce costs

Endpoint Returns Errors

Symptoms:
  • API calls to the endpoint fail
  • 4xx or 5xx errors returned
Common Errors:

“404 Not Found” for endpoint

Cause: Deployment ID is incorrect or endpoint was deleted. Solution:
  1. Verify the deployment ID in the Deploy dashboard
  2. Check if the deployment is still active
  3. Ensure you’re using the correct base URL: https://api.llmtune.io/v1

”503 Service Unavailable” or high latency

Cause: Endpoint is overloaded or scaling up. Solution:
  1. Check if the deployment is in “healthy” state
  2. Review runtime observability metrics
  3. Consider increasing min replicas in deployment settings
  4. Implement retry logic in your client

”Model loading” timeout

Cause: Model is taking too long to load. Solution:
  1. Check deployment status - it may still be initializing
  2. Larger models take longer to load
  3. If the issue persists, contact support

Can’t Rollback Deployment

Symptoms:
  • Rollback button is disabled or fails
  • Previous version not available
Solutions:
  1. Ensure the previous version still exists
  2. Check if you have sufficient permissions
  3. Review deployment history to see available versions
  4. Contact support if the rollback is critical

API & Inference Issues

”401 Unauthorized” Error

Symptoms:
  • API requests fail with authentication errors
  • “Invalid API key” message
Solutions:
  1. Verify the API key is correct
  2. Check if the key is still active (not revoked)
  3. Ensure the Authorization header format is correct: Bearer YOUR_API_KEY
  4. Create a new API key if needed

”402 Payment Required” Error

Symptoms:
  • API requests fail with payment errors
  • Insufficient credits message
Solutions:
  1. Check your balance in the Usage dashboard
  2. Add credits via Stripe integration
  3. Review your usage to understand costs
  4. Set up automatic top-ups to avoid interruptions

”404 Not Found” for Model

Symptoms:
  • Model endpoint returns 404
  • “Model not found” error
Solutions:
  1. Verify the model ID is correct
  2. Check if the deployment is still active
  3. Ensure you’re using the correct endpoint URL
  4. Check the Deploy dashboard for the correct deployment ID

”429 Rate Limited” Error

Symptoms:
  • API requests fail with rate limit errors
  • Retry-After header in response
Solutions:
  1. Implement exponential backoff in your client
  2. Check the Retry-After header for wait time
  3. Review your usage patterns
  4. Consider upgrading your plan for higher limits

Slow Response Times

Symptoms:
  • API calls take longer than expected
  • High latency in responses
Solutions:
  1. Check deployment runtime observability metrics
  2. Review model size - larger models have higher latency
  3. Consider using a smaller, faster model
  4. Implement caching for repeated requests
  5. Check if the endpoint is scaling properly

Unexpected Model Outputs

Symptoms:
  • Model returns incorrect or unexpected responses
  • Outputs don’t match fine-tuning data
Solutions:
  1. Verify the correct deployment is being used
  2. Check if the model was fine-tuned with appropriate data
  3. Review evaluation results to understand model behavior
  4. Consider retraining with improved dataset
  5. Adjust inference parameters (temperature, top_p)

Dataset Issues

Dataset Upload Fails

Symptoms:
  • Upload gets stuck or fails
  • “Invalid file format” error
Solutions:
  1. Ensure file format is supported (JSONL, CSV, TXT)
  2. Check file size limits
  3. Verify file encoding is UTF-8
  4. For JSONL, ensure each line is valid JSON
  5. Try uploading a smaller sample file first

Dataset Validation Errors

Symptoms:
  • Dataset shows validation errors in your data library
  • “Missing required fields” error
Solutions:
  1. Review the validation error details
  2. Check that JSONL has required fields (messages or conversations)
  3. Verify all JSON objects are properly formatted
  4. Check for special characters that need escaping
  5. Use the playground dataset format as reference

PII Detection Issues

Symptoms:
  • Too many or too few PII detections
  • False positives/negatives
Solutions:
  1. Review the PII detection report
  2. Adjust PII detection sensitivity if available
  3. Manually review and flag false positives
  4. Ensure data doesn’t contain actual PII if not intended
  5. Contact support if detection seems incorrect

Low Quality Score

Symptoms:
  • Dataset receives low quality score
  • Warnings about data quality
Solutions:
  1. Review quality report details
  2. Check for duplicate entries
  3. Ensure consistent formatting across entries
  4. Verify conversation turns are properly labeled
  5. Add more diverse examples if coverage is low
  6. Clean data before upload (remove empty entries, fix formatting)

Evaluation Issues

Evaluation Fails to Run

Symptoms:
  • Evaluation button doesn’t respond
  • “Evaluation failed” error
Solutions:
  1. Ensure the training job completed successfully
  2. Check if the model is accessible
  3. Verify the deployment is active and healthy
  4. Try a simpler evaluation first (single prompt)
  5. Check browser console for errors

Comparison Not Working

Symptoms:
  • Can’t compare with base model
  • Base model unavailable
Solutions:
  1. Ensure the base model is available in the system
  2. Check if the base model is compatible with your fine-tuned model
  3. Try using the playground to test the base model
  4. Contact support if base model should be available

Batch Evaluation Slow

Symptoms:
  • Batch evaluation takes a long time
  • Progress seems stuck
Solutions:
  1. Batch evaluation processes prompts sequentially
  2. Check the number of prompts in your batch
  3. Reduce batch size for faster results
  4. Ensure deployment can handle the load
  5. Monitor progress in the evaluation interface

Unexpected Evaluation Results

Symptoms:
  • Metrics don’t match expectations
  • Quality scores seem incorrect
Solutions:
  1. Verify evaluation prompts are appropriate
  2. Check if inference parameters are suitable
  3. Review sample outputs to understand behavior
  4. Compare with base model for context
  5. Consider adjusting evaluation criteria

Account & Billing Issues

Can’t Create API Key

Symptoms:
  • API key creation fails
  • “Limit reached” error
Solutions:
  1. Check how many keys you’ve created
  2. Delete unused keys if limit reached
  3. Contact support for key limit increase
  4. Ensure you have admin permissions

Billing Questions

Symptoms:
  • Unexpected charges
  • Need to understand billing
Solutions:
  1. Review Usage dashboard for breakdown
  2. Check billing reports in the dashboard
  3. Understand training vs inference costs
  4. Contact support for billing disputes

Can’t Add Credits

Symptoms:
  • Stripe integration fails
  • Payment processing error
Solutions:
  1. Check if Stripe is enabled in your region
  2. Verify payment method details
  3. Try a different payment method
  4. Contact support if issue persists

Workspace Access Issues

Symptoms:
  • Can’t access workspace
  • “Not authorized” error
Solutions:
  1. Verify you’re logged into the correct account
  2. Check if you’ve been invited to the workspace
  3. Contact workspace admin for access
  4. Ensure your account is verified

Getting Help

If you can’t resolve your issue with the solutions above, here are additional ways to get help:

Contact Support

  • Email: support@llmtune.io
  • In-app support: Use the support panel in the dashboard
  • Response time: Typically within 24 hours for paid plans

When Contacting Support

Include the following information to help us resolve your issue faster:
  1. Workspace ID - Found in workspace settings
  2. Job/Deployment ID - For training or deployment issues
  3. Error message - Copy the exact error text
  4. Timestamps - When did the issue occur?
  5. Steps to reproduce - What were you doing when it happened?
  6. Browser/Client info - Browser version or SDK version

Self-Service Resources

Status Page

Check the LLMTune Status Page for:
  • Current system status
  • Scheduled maintenance
  • Known incidents and outages

Escalation

For urgent production issues affecting your business:
  1. Mark your support ticket as “Urgent”
  2. Include business impact details
  3. Enterprise customers have access to priority support

Before contacting support, try reproducing the issue in a different browser or environment. This helps identify if the issue is specific to your setup.