Tips: Building a Dataset

  • Make sure you keep to the same format of prompt, both in the dataset and while querying the trained model. For example, you should keep the same instruction for each input and make sure the number of spaces is identical.
  • Finish each prompt with a newline, or start every completion with a whitespace. Make sure you do the same for the entire dataset.
  • Make sure the examples in your dataset reflect the type of tasks the model will face. Examples:
    • If you are generating copy for a wide variety of industries, you should have examples from all these industries in your dataset.
    • If you’re building a model that generates movie taglines and only train it on comedies, it won’t do a great job in generating horror movie taglines.
  • You should clearly define what you expect from the output and then create examples that show these expectations. If you expect information from the input to appear in the output, provide only examples that follow that logical path. Every completion in your dataset should be something you will be happy to get from the model.
  • Your model is as good as your data. Make sure you are providing your best examples.
  • Although 50 is the minimum, try to provide more examples, if possible. If you end up with 50 examples, make sure you provide a large variety to avoid overfitting the model to certain types of examples.
  • Make sure your dataset is clean and has no wrong, empty or duplicated examples.
  • Remove duplicates to prevent data leakage - otherwise, you can end up with an example that appears both in train and test, and the test loss will not reflect the “true” loss.
  • Just like with few-shot prompts, having a clear instruction may help, even the phrasing of your prompt matters, and it’s not an exact science. Try different variations and see what works best.
  • As opposed to few-shot prompts, here, you don’t need to end your generations with a distinctive token (e.g ##). The model learns when to finish the generation and it will simply provide a completion and nothing else.