Dataset format
Training jobs require data in a format the platform accepts. The following is a typical pattern; exact requirements may depend on the base model and training method.JSONL (common)
Each line is a JSON object. A typical structure for instruction-following or chat:- Use UTF-8 encoding.
- One record per line; no trailing commas between lines.
- Field names (e.g.
prompt/completionvsmessages) must match what the platform expects for the chosen training method.
Other formats
The platform may support additional formats (e.g. CSV or a specific schema). Check the dashboard and the training start API for the current list. When uploading or referencing a dataset, use a supported format and the correct content type.Quality and size
- Consistency — Keep field names and structure consistent across records.
- Size — Very large datasets may have limits or longer processing; the API or dashboard will indicate any constraints.
- Validation — Invalid lines or malformed JSON can cause the job to fail or skip rows; validate before uploading.
Referencing the dataset
When starting a job you typically pass:- An uploaded dataset ID (from the dashboard or datasets API), or
- A URL to a dataset file the platform can fetch (e.g. HTTPS).
datasetId, datasetUrl) are defined in the training start endpoint documentation.