Evaluate
The LLMTune evaluation interface provides comprehensive tools to test and compare your fine-tuned models. Use it to verify model performance, compare against base models, and track evaluation metrics over time.Features
Single Prompt Evaluation
- Test individual prompts with your fine-tuned model
- Adjust inference parameters (max tokens, temperature, top P, top K)
- View full output with input context
- Save evaluation history for future reference
- Copy and share results
Comparison Evaluation
- Compare fine-tuned model against the base model
- Side-by-side output comparison
- Performance metrics (latency, output length)
- Visual charts showing differences
- Key differences summary
Batch Evaluation
- Evaluate multiple prompts at once (one per line)
- Process prompts sequentially with progress tracking
- Summary statistics (success rate, average latency, average output length)
- Export results as CSV
- Automatic history saving for successful evaluations
Results Dashboard
- Comprehensive overview of all evaluations
- Metrics summary cards (total evaluations, single prompts, comparisons)
- Output length trend chart
- Comparison metrics bar chart
- Detailed results table with timestamps
- Export all results functionality
Evaluation Presets
Quick configuration presets for different use cases:- Quick Test: Fast evaluation with short responses (50 tokens, temp 0.7)
- Detailed Response: Longer, comprehensive answers (500 tokens, temp 0.3)
- Creative: More creative and diverse outputs (200 tokens, temp 1.0)
- Precise: Focused and accurate responses (100 tokens, temp 0.1)
- Balanced: Good balance between creativity and accuracy (128 tokens, temp 0.7)
Prompt Templates Library
Pre-built prompt templates organized by category:- General Knowledge: Questions about AI, ML, and technology
- Code Generation: Programming tasks and algorithms
- Summarization: Text summarization prompts
- Reasoning: Logical reasoning and problem-solving
- Creative Writing: Creative and narrative prompts
Workflow
Single Prompt Evaluation
- Open a completed training job
- Click Evaluate to open the evaluation interface
- Select Single Prompt mode
- Choose an evaluation preset or configure parameters manually
- Optionally select a prompt from the templates library
- Enter your test prompt
- Click Evaluate and review the output
- Use Copy or Share to export results
Comparison Evaluation
- Open the evaluation interface
- Select Compare with Base Model mode
- Enter your test prompt
- Click Evaluate
- Review side-by-side comparison:
- Fine-tuned model output
- Base model output
- Performance metrics
- Key differences summary
- Export comparison results if needed
Batch Evaluation
- Select Batch Evaluation mode
- Enter multiple prompts (one per line) or use templates
- Configure evaluation parameters
- Click Evaluate Batch
- Monitor progress as prompts are processed
- Review summary statistics:
- Success rate
- Average latency
- Average output length
- Export results as CSV
Results Dashboard
- Select Results Dashboard mode
- View comprehensive metrics:
- Total evaluations count
- Single prompts count
- Comparisons count
- Average output length
- Analyze trends:
- Output length over time chart
- Comparison metrics bar chart
- Review detailed results table
- Export all results if needed
Evaluation Parameters
| Parameter | Description | Range |
|---|---|---|
| Max Tokens | Maximum number of tokens to generate | 1-512 |
| Temperature | Sampling temperature (higher = more creative) | 0.0-2.0 |
| Top P | Nucleus sampling threshold | 0.0-1.0 |
| Top K | Top-K sampling limit | 1-100 |
Best Practices
- Start with presets: Use evaluation presets to quickly test different scenarios
- Use templates: Leverage prompt templates for consistent testing
- Batch testing: Use batch evaluation for comprehensive model validation
- Track history: Review evaluation history to identify patterns
- Compare regularly: Compare fine-tuned models against base models to measure improvement
- Export results: Export evaluation results for documentation and analysis
Troubleshooting
- Empty outputs: Increase max tokens or adjust temperature
- Evaluation fails: Check that the training job completed successfully
- Base model unavailable: Ensure the base model is accessible in the executor
- Slow batch evaluation: Batch evaluation processes sequentially; large batches may take time