Skip to main content

Datasets

Datasets are the foundation of every fine-tune. LLMTune’s Dataset Hub supports multiple data sources, quality scoring, PII detection, and automatic cleaning.

Dataset Hub Overview

Dataset Hub provides:
  • Multiple data sources – Upload files, connect HuggingFace Hub, or link cloud storage (S3, GCS)
  • Quality scoring – Automatic quality metrics and validation
  • PII detection – Detect and mask personally identifiable information
  • Automatic cleaning – Clean and prepare data automatically
  • Version control – Track and manage dataset versions

Upload Workflow

Direct Upload

  1. Navigate to Dataset Hub from the main navigation.
  2. Click Upload Dataset.
  3. Drag-and-drop files or browse to select:
    • JSONL – Preferred format with messages or conversations arrays
    • CSV – Column-based data (LLMTune will prompt you to map columns)
    • TXT – Plain text files
  4. Choose a dataset name and optional description.
  5. LLMTune runs automatic validation:
    • Schema detection
    • Format validation
    • Quality scoring
    • PII detection

External Sources

  1. Click Connect Source in Dataset Hub.
  2. Choose from:
    • HuggingFace Hub – Connect datasets from HuggingFace
    • Cloud Storage – Connect from S3, GCS, or Azure Blob
    • External URLs – Connect to any HTTPS endpoint
  3. Configure connection settings and authentication.
  4. LLMTune syncs and validates the data.

Dataset Formats

For most training methods, use JSONL with conversation-style data:
{"messages": [{"role": "user", "content": "What is AI?"}, {"role": "assistant", "content": "AI is..."}]}
{"messages": [{"role": "user", "content": "Explain ML"}, {"role": "assistant", "content": "ML is..."}]}
Or with conversations format:
{"conversations": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi!"}]}

Specialized Formats

  • Multimodal – Include image URLs or base64-encoded images
  • Audio methods – Include audio file paths with transcripts
  • Code generation – Code examples with natural language prompts

Quality Scoring

Dataset Hub automatically scores datasets on:
  • Data quality metrics – Completeness, consistency, format validity
  • Token distribution – Analysis of token counts and distributions
  • Label balance – For classification tasks, check class distribution
  • Coverage – Check conversation depth and variety
Review quality reports in Dataset Hub to identify issues before training.

PII Detection and Masking

Dataset Hub automatically:
  • Detects PII – Identifies emails, phone numbers, credit cards, SSNs, etc.
  • Masks sensitive data – Replaces PII with placeholders before training
  • Generates reports – Shows what was detected and masked
This ensures compliance with privacy regulations and protects sensitive information.

Versioning and Tags

  • Automatic versioning – Each upload becomes a new version
  • Rollback support – Revert to previous versions if needed
  • Tags – Add tags (e.g., priority:high, channel:support) to filter subsets
  • Metadata – Store notes, descriptions, and custom metadata

Blending Sources

During fine-tuning in FineTune Studio, you can blend multiple datasets:
  • Assign weights (e.g., support_chat:0.7, policy_docs:0.3)
  • Preview the combined distribution
  • Mix datasets from different sources (uploaded + HuggingFace + cloud storage)

Playground Datasets

FineTune Studio includes pre-configured playground datasets for each training method:
  • Automatically selected based on your chosen training method
  • Pre-validated and ready to use
  • Perfect for quick testing and experimentation
  • Can be used as-is or combined with your own datasets

Best Practices

  1. Start with playground datasets – Use them to test training methods quickly
  2. Validate before training – Review quality scores and PII detection reports
  3. Use versioning – Keep track of dataset changes and rollback if needed
  4. Tag datasets – Use tags to organize and filter datasets
  5. Check format compatibility – Ensure your dataset format matches your training method

Next Steps