Training Datasets

To train a custom model, you will need to collect a dataset of training examples. Each example consists of a prompt, which is a valid input text, and a completion, which is a correct output for that specific prompt.

πŸ”’ How many examples do I need?

At least 50 training examples are required to train a custom model. If you upload a file that contains fewer than 50 examples, you will encounter an error. More is better - so provide as many examples as you can.

🧩 It's all about quantity?

No! Data quality and diversity matter as well, so it’s a good idea to go over your examples manually and make sure they accurately capture various aspects of your task. It's critical that your examples reflect the real-world distribution of your problem.

File Formats

The dataset must be submitted in a single csv or jsonl file. We recommend using jsonl due to better handling of line breaks.

If you are submitting a csv file, it should have two columns labeled "prompt" and "completion". Each row should contain a single training example, with no empty rows in between.

If you are submitting a jsonl file, each line in the file should be a json dictionary with two fields, "prompt" and "completion".

Below is a simple example for both csv and jsonl formats:

"Where was the steam engine invented?
"In what year did the US gain its independence?
"Who was the longest reigning roman emperor?
{"prompt": "Where was the steam engine invented?\n", "completion": "England"}
{"prompt": "In what year did the US gain its independence?\n", "completion": "1776"}
{"prompt": "Who was the longest reigning roman emperor?\n", "completion": "Augustus"}

πŸ›‘ Common Errors


Uploading a file with fewer than 2 columns or with inconsistent columns in different lines will result in an error.

You may upload files with different column names, in which case you will be prompted to identify the desired prompt column and the desired completion column.


Completions cannot be empty

prompts can be empty.