Evaluate

The LLMTune evaluation interface provides comprehensive tools to test and compare your fine-tuned models. Use it to verify model performance, compare against base models, and track evaluation metrics over time.

Features

Single Prompt Evaluation

Test individual prompts with your fine-tuned model
Adjust inference parameters (max tokens, temperature, top P, top K)
View full output with input context
Save evaluation history for future reference
Copy and share results

Comparison Evaluation

Compare fine-tuned model against the base model
Side-by-side output comparison
Performance metrics (latency, output length)
Visual charts showing differences
Key differences summary

Batch Evaluation

Evaluate multiple prompts at once (one per line)
Process prompts sequentially with progress tracking
Summary statistics (success rate, average latency, average output length)
Export results as CSV
Automatic history saving for successful evaluations

Results Dashboard

Comprehensive overview of all evaluations
Metrics summary cards (total evaluations, single prompts, comparisons)
Output length trend chart
Comparison metrics bar chart
Detailed results table with timestamps
Export all results functionality

Evaluation Presets

Quick configuration presets for different use cases:

Quick Test: Fast evaluation with short responses (50 tokens, temp 0.7)
Detailed Response: Longer, comprehensive answers (500 tokens, temp 0.3)
Creative: More creative and diverse outputs (200 tokens, temp 1.0)
Precise: Focused and accurate responses (100 tokens, temp 0.1)
Balanced: Good balance between creativity and accuracy (128 tokens, temp 0.7)

Prompt Templates Library

Pre-built prompt templates organized by category:

General Knowledge: Questions about AI, ML, and technology
Code Generation: Programming tasks and algorithms
Summarization: Text summarization prompts
Reasoning: Logical reasoning and problem-solving
Creative Writing: Creative and narrative prompts

Workflow

Single Prompt Evaluation

Open a completed training job
Click Evaluate to open the evaluation interface
Select Single Prompt mode
Choose an evaluation preset or configure parameters manually
Optionally select a prompt from the templates library
Enter your test prompt
Click Evaluate and review the output
Use Copy or Share to export results

Comparison Evaluation

Open the evaluation interface
Select Compare with Base Model mode
Enter your test prompt
Click Evaluate
Review side-by-side comparison:
- Fine-tuned model output
- Base model output
- Performance metrics
- Key differences summary
Export comparison results if needed

Batch Evaluation

Select Batch Evaluation mode
Enter multiple prompts (one per line) or use templates
Configure evaluation parameters
Click Evaluate Batch
Monitor progress as prompts are processed
Review summary statistics:
- Success rate
- Average latency
- Average output length
Export results as CSV

Results Dashboard

Select Results Dashboard mode
View comprehensive metrics:
- Total evaluations count
- Single prompts count
- Comparisons count
- Average output length
Analyze trends:
- Output length over time chart
- Comparison metrics bar chart
Review detailed results table
Export all results if needed

Evaluation Parameters

Parameter	Description	Range
Max Tokens	Maximum number of tokens to generate	1-512
Temperature	Sampling temperature (higher = more creative)	0.0-2.0
Top P	Nucleus sampling threshold	0.0-1.0
Top K	Top-K sampling limit	1-100

Best Practices

Start with presets: Use evaluation presets to quickly test different scenarios
Use templates: Leverage prompt templates for consistent testing
Batch testing: Use batch evaluation for comprehensive model validation
Track history: Review evaluation history to identify patterns
Compare regularly: Compare fine-tuned models against base models to measure improvement
Export results: Export evaluation results for documentation and analysis

Troubleshooting

Empty outputs: Increase max tokens or adjust temperature
Evaluation fails: Check that the training job completed successfully
Base model unavailable: Ensure the base model is accessible in the executor
Slow batch evaluation: Batch evaluation processes sequentially; large batches may take time

Getting started

Core concepts

How-to guides

Evaluate

Evaluate

Features

Single Prompt Evaluation

Comparison Evaluation

Batch Evaluation

Results Dashboard

Evaluation Presets

Prompt Templates Library

Workflow

Single Prompt Evaluation

Comparison Evaluation

Batch Evaluation

Results Dashboard

Evaluation Parameters

Best Practices

Troubleshooting

Getting started

Core concepts

How-to guides

​Evaluate

​Features

​Single Prompt Evaluation

​Comparison Evaluation

​Batch Evaluation

​Results Dashboard

​Evaluation Presets

​Prompt Templates Library

​Workflow

​Single Prompt Evaluation

​Comparison Evaluation

​Batch Evaluation

​Results Dashboard

​Evaluation Parameters

​Best Practices

​Troubleshooting

Evaluate

Features

Single Prompt Evaluation

Comparison Evaluation

Batch Evaluation

Results Dashboard

Evaluation Presets

Prompt Templates Library

Workflow

Single Prompt Evaluation

Comparison Evaluation

Batch Evaluation

Results Dashboard

Evaluation Parameters

Best Practices

Troubleshooting