Skip to main content

Datasets

Datasets are the foundation of every fine-tune. LLMTune supports conversational, document, and classification data through JSONL, CSV, or text uploads.

Upload Workflow

  1. Drag-and-drop files or pick from cloud storage (S3, GCS) if connected.
  2. Choose a dataset name and optional description.
  3. LLMTune runs schema detection to identify roles, prompts, responses, and metadata.
  4. Review profiling output:
    • Sample rows
    • Token estimates
    • Schema anomalies

Versioning and Tags

  • Each upload becomes a version. You can rollback or compare changes.
  • Add tags (e.g., priority:high, channel:support) to filter subsets later.
  • Use the dataset editor to annotate, redact, or merge records.

Blending Sources

During fine-tuning you can blend multiple datasets by assigning weights. For example, mix customer support conversations with policy documents to enforce tone.

Quality Controls

  • Flag samples: Mark problematic records for follow-up.
  • Mask fields: Replace sensitive data (emails, account numbers) with placeholders before training.
  • Evaluate coverage: Use the analytics to check label distribution and conversation depth.