Skip to main content

Evaluate

The LLMTune evaluation interface provides comprehensive tools to test and compare your fine-tuned models. Use it to verify model performance, compare against base models, and track evaluation metrics over time.

Features

Single Prompt Evaluation

  • Test individual prompts with your fine-tuned model
  • Adjust inference parameters (max tokens, temperature, top P, top K)
  • View full output with input context
  • Save evaluation history for future reference
  • Copy and share results

Comparison Evaluation

  • Compare fine-tuned model against the base model
  • Side-by-side output comparison
  • Performance metrics (latency, output length)
  • Visual charts showing differences
  • Key differences summary

Batch Evaluation

  • Evaluate multiple prompts at once (one per line)
  • Process prompts sequentially with progress tracking
  • Summary statistics (success rate, average latency, average output length)
  • Export results as CSV
  • Automatic history saving for successful evaluations

Results Dashboard

  • Comprehensive overview of all evaluations
  • Metrics summary cards (total evaluations, single prompts, comparisons)
  • Output length trend chart
  • Comparison metrics bar chart
  • Detailed results table with timestamps
  • Export all results functionality

Evaluation Presets

Quick configuration presets for different use cases:
  • Quick Test: Fast evaluation with short responses (50 tokens, temp 0.7)
  • Detailed Response: Longer, comprehensive answers (500 tokens, temp 0.3)
  • Creative: More creative and diverse outputs (200 tokens, temp 1.0)
  • Precise: Focused and accurate responses (100 tokens, temp 0.1)
  • Balanced: Good balance between creativity and accuracy (128 tokens, temp 0.7)

Prompt Templates Library

Pre-built prompt templates organized by category:
  • General Knowledge: Questions about AI, ML, and technology
  • Code Generation: Programming tasks and algorithms
  • Summarization: Text summarization prompts
  • Reasoning: Logical reasoning and problem-solving
  • Creative Writing: Creative and narrative prompts

Workflow

Single Prompt Evaluation

  1. Open a completed training job
  2. Click Evaluate to open the evaluation interface
  3. Select Single Prompt mode
  4. Choose an evaluation preset or configure parameters manually
  5. Optionally select a prompt from the templates library
  6. Enter your test prompt
  7. Click Evaluate and review the output
  8. Use Copy or Share to export results

Comparison Evaluation

  1. Open the evaluation interface
  2. Select Compare with Base Model mode
  3. Enter your test prompt
  4. Click Evaluate
  5. Review side-by-side comparison:
    • Fine-tuned model output
    • Base model output
    • Performance metrics
    • Key differences summary
  6. Export comparison results if needed

Batch Evaluation

  1. Select Batch Evaluation mode
  2. Enter multiple prompts (one per line) or use templates
  3. Configure evaluation parameters
  4. Click Evaluate Batch
  5. Monitor progress as prompts are processed
  6. Review summary statistics:
    • Success rate
    • Average latency
    • Average output length
  7. Export results as CSV

Results Dashboard

  1. Select Results Dashboard mode
  2. View comprehensive metrics:
    • Total evaluations count
    • Single prompts count
    • Comparisons count
    • Average output length
  3. Analyze trends:
    • Output length over time chart
    • Comparison metrics bar chart
  4. Review detailed results table
  5. Export all results if needed

Evaluation Parameters

ParameterDescriptionRange
Max TokensMaximum number of tokens to generate1-512
TemperatureSampling temperature (higher = more creative)0.0-2.0
Top PNucleus sampling threshold0.0-1.0
Top KTop-K sampling limit1-100

Best Practices

  1. Start with presets: Use evaluation presets to quickly test different scenarios
  2. Use templates: Leverage prompt templates for consistent testing
  3. Batch testing: Use batch evaluation for comprehensive model validation
  4. Track history: Review evaluation history to identify patterns
  5. Compare regularly: Compare fine-tuned models against base models to measure improvement
  6. Export results: Export evaluation results for documentation and analysis

Troubleshooting

  • Empty outputs: Increase max tokens or adjust temperature
  • Evaluation fails: Check that the training job completed successfully
  • Base model unavailable: Ensure the base model is accessible in the executor
  • Slow batch evaluation: Batch evaluation processes sequentially; large batches may take time