Datasets
Datasets are the foundation of every fine-tune. LLMTune’s Dataset Hub supports multiple data sources, quality scoring, PII detection, and automatic cleaning.Dataset Hub Overview
Dataset Hub provides:- Multiple data sources – Upload files, connect HuggingFace Hub, or link cloud storage (S3, GCS)
- Quality scoring – Automatic quality metrics and validation
- PII detection – Detect and mask personally identifiable information
- Automatic cleaning – Clean and prepare data automatically
- Version control – Track and manage dataset versions
Upload Workflow
Direct Upload
- Navigate to Dataset Hub from the main navigation.
- Click Upload Dataset.
- Drag-and-drop files or browse to select:
- JSONL – Preferred format with
messagesorconversationsarrays - CSV – Column-based data (LLMTune will prompt you to map columns)
- TXT – Plain text files
- JSONL – Preferred format with
- Choose a dataset name and optional description.
- LLMTune runs automatic validation:
- Schema detection
- Format validation
- Quality scoring
- PII detection
External Sources
- Click Connect Source in Dataset Hub.
- Choose from:
- HuggingFace Hub – Connect datasets from HuggingFace
- Cloud Storage – Connect from S3, GCS, or Azure Blob
- External URLs – Connect to any HTTPS endpoint
- Configure connection settings and authentication.
- LLMTune syncs and validates the data.
Dataset Formats
JSONL Format (Recommended)
For most training methods, use JSONL with conversation-style data:Specialized Formats
- Multimodal – Include image URLs or base64-encoded images
- Audio methods – Include audio file paths with transcripts
- Code generation – Code examples with natural language prompts
Quality Scoring
Dataset Hub automatically scores datasets on:- Data quality metrics – Completeness, consistency, format validity
- Token distribution – Analysis of token counts and distributions
- Label balance – For classification tasks, check class distribution
- Coverage – Check conversation depth and variety
PII Detection and Masking
Dataset Hub automatically:- Detects PII – Identifies emails, phone numbers, credit cards, SSNs, etc.
- Masks sensitive data – Replaces PII with placeholders before training
- Generates reports – Shows what was detected and masked
Versioning and Tags
- Automatic versioning – Each upload becomes a new version
- Rollback support – Revert to previous versions if needed
- Tags – Add tags (e.g.,
priority:high,channel:support) to filter subsets - Metadata – Store notes, descriptions, and custom metadata
Blending Sources
During fine-tuning in FineTune Studio, you can blend multiple datasets:- Assign weights (e.g.,
support_chat:0.7,policy_docs:0.3) - Preview the combined distribution
- Mix datasets from different sources (uploaded + HuggingFace + cloud storage)
Playground Datasets
FineTune Studio includes pre-configured playground datasets for each training method:- Automatically selected based on your chosen training method
- Pre-validated and ready to use
- Perfect for quick testing and experimentation
- Can be used as-is or combined with your own datasets
Best Practices
- Start with playground datasets – Use them to test training methods quickly
- Validate before training – Review quality scores and PII detection reports
- Use versioning – Keep track of dataset changes and rollback if needed
- Tag datasets – Use tags to organize and filter datasets
- Check format compatibility – Ensure your dataset format matches your training method
Next Steps
- Learn about Fine-Tuning to use datasets in training
- Read about Model Configuration to understand model selection
- Check the Fine-Tuning Guide for dataset format requirements by training method