Generating More Data Using AI21 Studio

➿ Synthesizing training data

If you don’t have enough pre-existing training data, you can use prompt engineering on top of J2 models to efficiently generate more data. Assuming you’ve already created a prompt with just a few examples that works well for your task, you can leverage it in one of two ways:

  • If you have examples of inputs that represent your use case, feed them to J2 with your prompt, and collect the outputs generated by the model. The pairs of inputs and generated outputs will be your training set. Note that you can often collect relevant input examples relatively easily, either from public sources on the web or from your own data.

  • If you don’t have access to relevant input examples, you can let J2 generate both the inputs and the outputs. Feed J2 with a sequence of examples (input #1, output #1, input #2, output #2, and so on), and let it generate more examples. This tends to work well for short inputs, up to a few sentences long, but may result in a higher rate of bad examples for longer inputs (e.g., whole articles).

The process above is illustrated step-by-step in our blog post. Using J2-Ultra or J2-Mid rather than J2-Light is recommended for generating synthetic training data, since it typically produces higher quality results with prompt engineering.

We recommend you review and validate the generated content (or at least a sample from it) before training a custom model. This will ensure the data properly captures the desired behavior. Watch out for incorrect, corrupted or toxic generations, as including these in the training data will negatively affect the resulting custom model. If you find such bad examples, amend them manually or simply exclude them from the training data.