Datasets

Datasets are the foundation of every fine-tune. LLMTune’s Dataset Hub supports multiple data sources, quality scoring, PII detection, and automatic cleaning.

Dataset Hub Overview

Dataset Hub provides:

Multiple data sources – Upload files, connect HuggingFace Hub, or link cloud storage (S3, GCS)
Quality scoring – Automatic quality metrics and validation
PII detection – Detect and mask personally identifiable information
Automatic cleaning – Clean and prepare data automatically
Version control – Track and manage dataset versions

Upload Workflow

Direct Upload

Navigate to Dataset Hub from the main navigation.
Click Upload Dataset.
Drag-and-drop files or browse to select:
- JSONL – Preferred format with messages or conversations arrays
- CSV – Column-based data (LLMTune will prompt you to map columns)
- TXT – Plain text files
Choose a dataset name and optional description.
LLMTune runs automatic validation:
- Schema detection
- Format validation
- Quality scoring
- PII detection

External Sources

Click Connect Source in Dataset Hub.
Choose from:
- HuggingFace Hub – Connect datasets from HuggingFace
- Cloud Storage – Connect from S3, GCS, or Azure Blob
- External URLs – Connect to any HTTPS endpoint
Configure connection settings and authentication.
LLMTune syncs and validates the data.

Dataset Formats

JSONL Format (Recommended)

For most training methods, use JSONL with conversation-style data:

{"messages": [{"role": "user", "content": "What is AI?"}, {"role": "assistant", "content": "AI is..."}]}
{"messages": [{"role": "user", "content": "Explain ML"}, {"role": "assistant", "content": "ML is..."}]}

Or with conversations format:

{"conversations": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi!"}]}

Specialized Formats

Multimodal – Include image URLs or base64-encoded images
Audio methods – Include audio file paths with transcripts
Code generation – Code examples with natural language prompts

Quality Scoring

Dataset Hub automatically scores datasets on:

Data quality metrics – Completeness, consistency, format validity
Token distribution – Analysis of token counts and distributions
Label balance – For classification tasks, check class distribution
Coverage – Check conversation depth and variety

Review quality reports in Dataset Hub to identify issues before training.

PII Detection and Masking

Dataset Hub automatically:

Detects PII – Identifies emails, phone numbers, credit cards, SSNs, etc.
Masks sensitive data – Replaces PII with placeholders before training
Generates reports – Shows what was detected and masked

This ensures compliance with privacy regulations and protects sensitive information.

Versioning and Tags

Automatic versioning – Each upload becomes a new version
Rollback support – Revert to previous versions if needed
Tags – Add tags (e.g., priority:high, channel:support) to filter subsets
Metadata – Store notes, descriptions, and custom metadata

Blending Sources

During fine-tuning in FineTune Studio, you can blend multiple datasets:

Assign weights (e.g., support_chat:0.7, policy_docs:0.3)
Preview the combined distribution
Mix datasets from different sources (uploaded + HuggingFace + cloud storage)

Playground Datasets

FineTune Studio includes pre-configured playground datasets for each training method:

Automatically selected based on your chosen training method
Pre-validated and ready to use
Perfect for quick testing and experimentation
Can be used as-is or combined with your own datasets

Best Practices

Start with playground datasets – Use them to test training methods quickly
Validate before training – Review quality scores and PII detection reports
Use versioning – Keep track of dataset changes and rollback if needed
Tag datasets – Use tags to organize and filter datasets
Check format compatibility – Ensure your dataset format matches your training method

Next Steps

Learn about Fine-Tuning to use datasets in training
Read about Model Configuration to understand model selection
Check the Fine-Tuning Guide for dataset format requirements by training method

Getting started

Core concepts

How-to guides

Datasets

Datasets

Dataset Hub Overview

Upload Workflow

Direct Upload

External Sources

Dataset Formats

JSONL Format (Recommended)

Specialized Formats

Quality Scoring

PII Detection and Masking

Versioning and Tags

Blending Sources

Playground Datasets

Best Practices

Next Steps

Getting started

Core concepts

How-to guides

​Datasets

​Dataset Hub Overview

​Upload Workflow

​Direct Upload

​External Sources

​Dataset Formats

​JSONL Format (Recommended)

​Specialized Formats

​Quality Scoring

​PII Detection and Masking

​Versioning and Tags

​Blending Sources

​Playground Datasets

​Best Practices

​Next Steps

Datasets

Dataset Hub Overview

Upload Workflow

Direct Upload

External Sources

Dataset Formats

JSONL Format (Recommended)

Specialized Formats

Quality Scoring

PII Detection and Masking

Versioning and Tags

Blending Sources

Playground Datasets

Best Practices

Next Steps