Datasets: Best Practices

Best Practices for Prompt and Completion Content

The simplest way to structure the training data is where the prompt for each individual training example contains only the text input for that example. Note that this is different from few-shot prompt engineering for a general purpose model, where the prompt contains a sequence of examples of valid inputs and outputs.

In any case, the combined length of the prompt and completion must not exceed 2047 tokens or approximately 6000 characters of English text. If you include longer examples, you will encounter an error after uploading the file.

For best results, we recommend implementing the following guidelines for the prompt and completion. All the snippets below assume a .jsonl format is being used, but the same guidelines are applicable to a .csv file.

1. End prompts with a newline or start completions with a whitespace

Either of these helps avoid complications due to tokenization. Don’t mix and match; pick one and stick with it for all examples in the training set. See examples below:

{"prompt": "Where was the steam engine invented?\n", "completion": "England"}
{"prompt": "Where was the steam engine invented?", "completion": " England"}

2. Add a title or a separator

You should clearly mark the boundary between the prompt and the completion. A simple way to do this is to insert a separator at the end of each prompt. You can use any simple, distinct sequence, such as ## on a new line.

It is often a good idea to take this a step further by using titles. You can append a title that describes the input to the beginning of the prompt, and another title that describes the output to the end of the prompt (in place of the separator). The titles can be short, but it usually helps to pick straightforward titles that are not generic and relate to the content of the input and output.

Again, it’s important to be consistent. Pick one style and stick with it. If you use titles, they should be the same in all training examples. In any case, keep in mind the first guideline and either end the prompt with a newline or start the completion with a whitespace.

See a few examples below:

{"prompt": "I know so many languages\n##\n", "completion": "Ich kenne so viele Sprachen"}
{"prompt": "English: I know so many languages\nGerman:", "completion": " Ich kenne so viele Sprachen"}
{"prompt": "English:\nI know so many languages\nGerman:\n", "completion": "Ich kenne so viele Sprachen"}

3. Include natural language instructions

Adding a short instruction that explains the task to each prompt often improves results. As before, stay consistent and use the same instruction in all examples.

{"prompt": "Translate the following text from English to German.\nEnglish: It's my birthday today\nGerman:", "completion": " Heute ist mein Geburtstag"}
{"prompt": "English: It's my birthday today\nGerman:", "completion": " Heute ist mein Geburtstag"}

4. Remove duplicates

Duplicate entries are detrimental to the training, so we recommend removing them.

📘

Don't have enough examples?

Make sure you don't duplicate existing examples to reach 50 examples. If you do not have more examples, you can use AI21 Studio to generate some based on the ones you do have.

Handling Complex Inputs

You can also train a custom model to perform tasks that have more than one input. Simply combine all the inputs into the prompt, with an appropriate title for each input. For example, consider a model trained to generate short ads for cars from a list of technical specifications. The different fields can be organized in a template as in the following example:

Model name: Viesta
Fuel consumption: 60 MPG
Horsepower: 100 HP
Price: $11,000
##

An appropriate dataset for training such a model would look like the following:

{"prompt": "Model name: Viesta\nFuel consumption: 60 MPG\nHorsepower: 100 HP\nPrice: $11,000\n##\n", "completion": "The new Viesta model runs efficiently with an amazing 60 MPG fuel consumption, available now for $11,000."}
{"prompt": "Model name: Lambada\nFuel consumption: 15 MPG\nHorsepower: 500 HP\nPrice:$120,000\n##\n", "completion": "Feel the power with the new Lambada, burning the road with a mind-blowing 500-Horsepower engine."}