Skip to main content

Dataset format

Training jobs require data in a format the platform accepts. The following is a typical pattern; exact requirements may depend on the base model and training method.

JSONL (common)

Each line is a JSON object. A typical structure for instruction-following or chat:
{"prompt": "User question or instruction", "completion": "Expected answer"}
{"prompt": "...", "completion": "..."}
For chat-style data, messages can be used:
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [...]}
  • Use UTF-8 encoding.
  • One record per line; no trailing commas between lines.
  • Field names (e.g. prompt/completion vs messages) must match what the platform expects for the chosen training method.

Other formats

The platform may support additional formats (e.g. CSV or a specific schema). Check the dashboard and the training start API for the current list. When uploading or referencing a dataset, use a supported format and the correct content type.

Quality and size

  • Consistency — Keep field names and structure consistent across records.
  • Size — Very large datasets may have limits or longer processing; the API or dashboard will indicate any constraints.
  • Validation — Invalid lines or malformed JSON can cause the job to fail or skip rows; validate before uploading.

Referencing the dataset

When starting a job you typically pass:
  • An uploaded dataset ID (from the dashboard or datasets API), or
  • A URL to a dataset file the platform can fetch (e.g. HTTPS).
Exact parameters (e.g. datasetId, datasetUrl) are defined in the training start endpoint documentation.