Overview
Prompt engineering is the practice of creating the proper prompt to generate the output that you want. With the proper prompting, an LLM can do an amazing amount of tasks, including generating an email or product description, summarizing provided text, classifying text into standard or customized categories, responding to a customer query with an appropriate answer, and much more. Every model behaves differently given the same prompt, and so you’ll probably spend a fair amount of time adjusting your prompt for your specific use case or with different models. The best practices given here are not absolute rules, but guidelines.Components of a prompt
A prompt, in our context, consists of the following information. Other than the instruction, all other elements depend on the task and improving or customizing the response.- Instruction: What the model should do, often in great detail.
- The format or syntax of the response (markdown/text, HTML, JSON, XML).
- The persona of the LLM (“a friendly travel agent”).
- Any stylistic or other guidelines for the output (“simple, plain language, no more than two sentences long”)
- Examples to follow, if appropriate.
- Any text to be analyzed or acted upon (such as a user’s question or financial or medical information)
Developing the prompt
Your first prompt is rarely good enough, particularly when designing a prompt to use for a commercial system. You’ll spend a lot of time refining your prompt and assessing the results. Here is a typical workflow for designing and refining a prompt:- Start with just the instruction, without examples. You can ask the LLM for help creating a starter prompt and modify it from there.
- Test your prompt against one example, and evaluate the quality.
-
Modify the prompt, test, and repeat.
- While you’re adjusting your prompt, set the temperature to 0 to get consistent answers, then gradually increase the temperature in small increments up to where you are getting the amount of variation that you want.
- When the model produces bad output, bring the temperature back to zero and adjust the model again to try to understand what’s producing the bad output.
-
Collect a test set of 10 common inputs and ideal outputs and test your prompt against those inputs.
- Be sure that your test prompts include both common inputs and as well as inputs that test the extremes of valid requests.
- Include test examples of bad input to see how the model reacts to it.
- Grade the test results against the expected result (you can grade on a score from 1-10 or 1-100). Evaluate the results, adjust your prompt, and repeat.
Prompt design tips
Try asking the model for help with your initial prompt
Ask the model to provide you with a starter prompt, and start working with that. Ask in clear, simple language, providing as much detail as you can. This can save you a bit of time and effort in the beginning.Ensure that the prompt is clear
When crafting prompts for Jamba models, follow this fundamental principle: Write your prompts as you would want them to be written for you. If you can’t understand the prompt well, neither will an LLM. To elaborate:- Clarity and Comprehensibility: Is the prompt clear to you? Would it be understandable to someone approaching it without any prior knowledge? Ensure that your prompt is free from grammatical errors and that the instructions are expressed clearly and unambiguously.
- Formatting and Examples: Is the prompt well-organized and properly formatted? Have you provided examples where necessary to illustrate the expected outcome? More about examples later.
Describe what you want, not what you don’t want
Focus on telling the model what you want to do. Minimize “do nots.” Of course, you can use negatives occasionally, but excessive use of “don’t do” and “avoid x” is a sign that your prompt may be going in the wrong direction and you may want to rewrite your prompt. Here is an example of a type of prompt to avoid:Allow for “I don’t know”
Tell the model to say explicitly when it doesn’t have an answer or enough information to give an answer. Otherwise, the model might be compelled to try to give an answer even if that means supplying information that might not be true (LLMs are biased toward providing an answer).Set a persona for the model
Describe the role that the LLM should assume when answering the question. This is frequently referred to as a system prompt, and has been found to produce better completions for many families of LLMs. This affects not only the tone and language used, but also the amount of detail and level of expertise used. For example, asking the model to answer as a 10 year old might be good for an ad or a story, but might result in errors in an answer to a technical question. However, asking the model to assume an extremely specialize persona (a post-doctorate in sub-particle physics) might limit the source information being used to generate the information. System prompts should also be used to guide the perspective the model takes when answering the question; for example, by thinking about the problem as a research assistant, or a customer, or a novice. When accessing the model in code, the system prompt is specified by an initialrole:system message. In the playground, provide the information in the System instructions section. Alternatively, you can put the system prompt directly into the prompt itself, although this might be less effective.
Put the data before the question
When asking a question about information given in the prompt (in-prompt contextual answer), first provide the data and then the ask question. For example:Use the appropriate tool
For straightforward math calculations or other actions that can be done in simple code, use a more appropriate tool for the job (a calculator, a macro, a short code snippet). Those tools are designed specifically for the job, and provide much more controllable and consistent output than an LLM.Consistency vs creativity
Output variability is influenced by two parameters: temperature and top P:- Temperature Flattens or enhances the curve of probability values for all candidates in the pool. When the curve is completely flat (T=max) there is an equal likelihood for any candidate in the pool to be chosen, no matter how unlikely the candidate originally was. When the curve is completely enhanced (T=0), the most likely candidate is now the only choice, and all the other candidates have probability zero. Temperature does not adjust the size of the pool by eliminating candidates, unless you choose T=0, which essentially removes all candidates from the pool except the most likely. For most use cases, a temperature of around 0.7 is about right.
- Top P controls the size of the pool of candidates for the next token (remember P for Pool). The smaller the candidate pool, the less variability. Candidates are cut in order of least likely to most likely, so at its smallest size (0.01), only the most likely candidates–possibly only one–will be in the pool. At 1.0, the pool is full sized and contains all possible candidates (including some possible weirdos). In practice you won’t need to adjust the top P often.
max_tokens limit for high temperatures).
If the temperature is very high (greater than 0.7) reduce the top P a few tenths of a point to remove the very unlikely results unless you want some very high creativity, in which case you can keep the topP high (try a topP of 0.99 to omit extremes).
Limiting output length
If you have a length goal or limit for your output, specify it in the prompt as the desired or maximum number of lines, words, sentences, or paragraphs. Don’t expect the model to hit the mark exactly, as it can only see one word ahead, and it might need a bit more or less than you specify to provide a good answer. Examples: “Write two or three sentences about…,” or “Limit your answer to 10 words.” The API supports amax_tokens parameter, but you should use it only as a failsafe to prevent the edge case of the model going off in a completely unexpected direction and far exceeding your length limits (higher temperature can increase output length). This value is absolutely respected, so if the output hits this limit, the result might stop in the middle of a word. Set this value a fair bit higher than any prompt-suggested limit.
Requesting citations
LLMs have been known to invent citations, so asking Jamba Instruct for citations for its information is not a guarantee of accuracy. If you need absolutely reliable citations, use the RAG Engine Contextual Answer model.Use labels, not numbers, to rate output
It is common for people to want a numerical scoring of “good” or “bad” (“on a scale of 1 to 10…”). Assigning exact numbers to subjective categories is hard for people, and harder for LLMs, and also gives a false sense of accuracy. Simple labels like “Bad,” “Okay,” and “Best” are easier for the language model to provide than precise numerical ratings like 4.7 or 6.8. Note that although you can use numbers to represent categories it is generally preferable to use category labels that are inherently meaningful, such as “None”,“Some”, and “Most”. For example:Ask the LLM if it understands the prompt
To speed up development, consider first asking the LLM if it understands the instructions and other key terms in the prompt. This can help ensure that the LLM understands the core idea of what you are trying to do.Don’t do math
LLMs are famously bad at math. They’re getting better, but it’s still not a good idea, generally to ask an LLM to solve a word problem for you. LLMs can count and do sums, but word problems or logic are tricky and produce inconsistent results. For simple or straightforward math, use a calculator or other more appropriate tool . If you must evaluate a word problem or logic, you can increase the accuracy by asking the LLM to explain each step it takes in the process (called chain-of-thought prompting.Providing examples
Providing examples of what you want to see (also called “Few Shot prompts”) can be few useful to Jamba. Examples can be very helpful when- Providing clear instruction is laborious/difficult, or
- The model seems to not fully understand your instructions.
- Separate examples clearly with a blank line, or another obvious marker (some people use ###).
- Put the instructions all at the beginning or at the end.
- Be sure all examples use the same structure; don’t show different examples with different fields, or totally different structure.
- When providing just a few examples, the ordering of examples probably doesn’t matter. When using the RAG engine, or providing very, very large examples or input, you might find that the order of files used as input might affect the output (a “recency bias”).
Separate the instructions from the examples (and the examples from each other)
Put the instructions at the start or end of your examples. Separate instructions and examples with some clear separator—even a blank line will do. Some people use ## or >> marks, but empty lines should work fine. For example:Cover all cases in your examples
When providing examples, do your best to cover all relevant example types. For example, if the answer to a question posed to the LLM can be “yes” or “no,” provide examples where the answer is yes and examples where the answer is no. Ideally, the distribution should match real-life use cases as well. In the following example, we provide prompt examples that show how to respond to both vague and specific feedback from the user.Be aware of bias with matching examples
If you provide examples, the model can be biased toward input that closely resembles one of these examples. While this usually guides the model appropriately, it can also reduce its flexibility. Keep this in mind when choosing your examples.Grading results
During development, and also after release (if you’re using your prompt in a production system) you’ll need to evaluate your results. Which methods you use depend on your usage scale and where you are in the development process. For grading outputs with an absolute answer (such as classification or sentiment analysis), judge on the following criteria:- Accuracy (how many times the answer was correct)
- For prompts that are sensitive to wrong answers, calculate false positives and false negatives, and the cost of each, when deciding the bar to reach before deciding that your prompt is good enough.
Human evaluation
If you have the resources, human evaluation is often the best solution. Typically in the early stages of adjusting your prompts you’ll be evaluating the answers yourself. Create a list of criteria to consider when evaluating each answer (accuracy, clarity, usefulness). You can either rate each criteria individually or give a general overall score (1-5, good/OK/bad).Use an LLM to evaluate your results
You can try using another LLM to rank your answers. If you do, use a different LLM than the one that generated the answer (LLMs, like people, tend to be biased toward their own answers). For example, if you have a large block of information and your prompt asks the model to answer a specific question based on that information, you might write a prompt like this to evaluate the result generated from your first prompt:Maintaining output quality in production
Log your prompts and responses in production. Periodically check the output quality for a random selection of your generated answers. Provide a feedback button that sends you the question and generated response to help you improve your prompts.Break up complex requests into multiple prompts
If you have a complex task that requires several steps, you might want to break it into multiple prompts, and port the output of each step into the prompt for the next step. That way you can fine tune the results (and temperature) for each step. For example, to provide a list of doctors for a patient with a specific symptom, you might break it into these steps:- Given the patient’s symptoms, determine what the likely problem is
- Given the problem, decide which type of doctor handles that type of problem
- Look up the list of doctors with that specialty and provide contact information for each.