Dataset format

Training jobs require data in a format the platform accepts. The following is a typical pattern; exact requirements may depend on the base model and training method.

JSONL (common)

Each line is a JSON object. A typical structure for instruction-following or chat:

{"prompt": "User question or instruction", "completion": "Expected answer"}
{"prompt": "...", "completion": "..."}

For chat-style data, messages can be used:

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [...]}

Use UTF-8 encoding.
One record per line; no trailing commas between lines.
Field names (e.g. prompt/completion vs messages) must match what the platform expects for the chosen training method.

Other formats

The platform may support additional formats (e.g. CSV or a specific schema). Check the dashboard and the training start API for the current list. When uploading or referencing a dataset, use a supported format and the correct content type.

Quality and size

Consistency — Keep field names and structure consistent across records.
Size — Very large datasets may have limits or longer processing; the API or dashboard will indicate any constraints.
Validation — Invalid lines or malformed JSON can cause the job to fail or skip rows; validate before uploading.

Referencing the dataset

When starting a job you typically pass:

An uploaded dataset ID (from the dashboard or datasets API), or
A URL to a dataset file the platform can fetch (e.g. HTTPS).

Exact parameters (e.g. datasetId, datasetUrl) are defined in the training start endpoint documentation.

Overview

Inference

Agent

Fine-tuning

Billing & errors

Dataset format

Dataset format

JSONL (common)

Other formats

Quality and size

Referencing the dataset

Overview

Inference

Agent

Fine-tuning

Billing & errors

​Dataset format

​JSONL (common)

​Other formats

​Quality and size

​Referencing the dataset

Dataset format

JSONL (common)

Other formats

Quality and size

Referencing the dataset