Jamba-based chat and text completion model

This is the endpoint for the Jamba Instruct model. This is a foundation model that supports both single-turn (question answering, text completion) and multi-turn (chat style) interactions. The main difference between the two is that for a single turn interaction, the endpoint is called once with the prompt, but for a chat-style interaction the endpoint is called again with each user response, sending the entire chat history with each call.

You can optionally stream results if you want to get a response as each token is generated, rather than waiting for the complete response all at once.

Multi-turn usage (chat)

Use this endpoint to implement a chatbot based on the Jamba Instruct model.

The model does not store state between calls, so you must provide the full message history with each request to guide the model. The first message is an optional system message that sets the tone, voice, and context of the discussion. After that, messages must alternate between user and assistant messages (representing user input and model-generated responses) from oldest to newest. Here's an example message history with a hardware store assistant after two responses from the user:

[
  {role: "system", content: "You are a helpful hardware store assistant."},
  {role: "user", content:"I'd like to buy some #6 1-3/4 decking screws please."},
  {role: "assistant", content:"Sure! What screwhead type? We have Phillips and Torx."},
  {role: "user", content:"Torx"},
  {role: "assistant", content:"How many would you like? You can buy them by weight, or we have boxes of 100 or 500 at a discount."}
]

It is your responsibility to save the response history from each request if you want to make another request. If you omit items from the list, the model will not have them as context for its next response.

(For very long conversations, you might choose to truncate earlier messages to save cost for context that might not matter as much. Third party tools such as LangChain offer various helper tools to manage context length by summarizing or truncating older messages.)

You can request multiple answers to the latest user input by setting n greater than 1 in the request, but you can add only one into the message history.

Examples

from ai21 import AI21Client
from ai21.models.chat import ChatMessage

client = AI21Client() # Requires that os.environ["AI21_API_KEY"] is set to your AI21 API key

def test_chat(history:list[str], system_message:str = None):    
    roles=["user","assistant"]
    history_list = [ChatMessage(role=roles[i%2],content=history[i]) for i in range(len(history))]
    if system_message: history_list.insert(0, ChatMessage(role="system", content=system_message))

    response = client.chat.completions.create(
        model="jamba-instruct",
        messages=history_list,
        temperature=1.3, # A higher temperature makes the genie a little hipper
        max_tokens=200
    )
    return response.choices[0].message.content

conversation_history = [
    "I want a new car",
    "Great! What kind of car are you interested in?",
    "Sure! What kind of car would you like?",
    "Great choice, I can definitely help you with that! Before I grant your wish, can you tell me what kind of car you're looking for?",
    "A corvette"
]
system_message = system_message = """
  You are a helpful and hip genie just released from a bottle. 
  You start the conversation with 'Right on! I grant you one wish."""
print(test_chat(conversation_history, system_message=system_message))
import requests
ROOT_URL = "https://api.ai21.com/studio/v1/"

def test_chat(history:list[str], system_message:str = None):    
    roles=["user","assistant"]
    history_list = [{"role":roles[i%2],"content":history[i]} for i in range(len(history))]
    if system_message: history_list.insert(0, {"role":"system", "content":system_message})

    url = ROOT_URL + "chat/completions"
    response = requests.post(
      url,
      headers={"Authorization": f"Bearer {AI21_API_KEY}"},
      json={
        "model": "jamba-instruct", 
        "messages": history_list,
        "max_tokens": 200,
        "temperature": 1.3 # A higher temperature makes the genie a little hipper
      }
    )
    return response.json()["choices"][0]["message"]["content"]

conversation_history = [
    "I want a new car",
    "Great! What kind of car are you interested in?",
    "Sure! What kind of car would you like?",
    "Great choice, I can definitely help you with that! Before I grant your wish, can you tell me what kind of car you're looking for?",
    "A corvette"
]
system_message = system_message = """
  You are a helpful and hip genie just released from a bottle. 
  You start the conversation with 'Right on! I grant you one wish."""
print(test_chat(conversation_history, system_message=system_message))
You got it! I'ma conjure up a brand new Corvette for you.
Congratulations, your wish has been granted!
Single-turn usage (question answering)

To answer a question, provide examples, or continue a prompt, simply send a single call (with an optional system message to provide context) and a user message to answer or complete.

Examples

from ai21 import AI21Client
from ai21.models.chat import ChatMessage

client = AI21Client() # Requires that os.environ["AI21_API_KEY"] is set to your AI21 API key

def single_message_instruct():
    prompt = "Who was the first emperor of rome?"
    messages = [
        ChatMessage(
            role="user",
            content=prompt
        )
    ]
    response = client.chat.completions.create(
        model="jamba-instruct",
        messages=messages,
        max_tokens=200, # To keep the response from running off the rails
        temperature=0.7
    )
    # Default is only one response (n=1), so choices should have only one element
    print(response.to_json())
import requests
ROOT_URL = "https://api.ai21.com/studio/v1/"

def completions():
  url = ROOT_URL + "chat/completions"
  res = requests.post(url, 
                      headers={"Authorization": f"Bearer {AI21_API_KEY}"}, 
                      json={'model': 'jamba-instruct', 'messages': [
        {
          "role": "user",
          "content": "Who was the first emperor of rome?"
        }
      ],
    "max_tokens": 200, # To keep the response from running off the rails
    "temperature": 0.7,
    "stop": None,
  })
  print(res.json())
{
    "id": "cmpl-4e8f6ae429494df39080ce4f7386569e",
    "choices": [
        {
            "index": 0, // Can be > 1 when multiple answers requested
            "message": {
                "role": "assistant", // 'assistant' means the model
                "content": "The first emperor of Rome was Augustus Caesar, who ruled from 27 BC to 14 AD. He was the adopted heir of Julius Caesar and is known for transforming Rome from a republic into an empire, ushering in a period of relative peace and stability known as the Pax Romana."
            },
            "finish_reason": "stop" // Reached a natural answer length.
        }
    ],
    "usage": {
        "prompt_tokens": 77,
        "completion_tokens": 65,
        "total_tokens": 142
    }
}

Request details

Endpoint: POST v1/chat/completions

Request Parameters

Header parameters

Header: [required] Bearer token authorization required for all requests. Use your API key. Example: Authorization: Bearer asdfASDF5433

Body parameters

  • model: [string, required] The name of the model to use. Choose one of the following values:
    • jamba-instruct-preview.
  • messages: [list[object], required] The previous messages in this chat, from oldest (index 0) to newest. Messages must be alternating user/assistant messages, optionally starting with a system message. For single turn interactions, this should be an optional system message, and a single user message. Maximum total size for the list is about 256K tokens. Each message includes the following members:
    • role: [string, required] The role of the message author. One of the following values:
      • user: Input provided by the user. Any instructions given here that conflict with instructions given in the system prompt take precedence over the system prompt instructions.
      • assistant: Response generated by the model.
      • system: Initial instructions provided to the system to provide general guidance on the tone and voice of the generated message. An initial system message is optional but recommended to provide guidance on the tone of the chat. For example, "You are a helpful chatbot with a background in earth sciences and a charming French accent."
    • content: [string, required] The content of the message.
  • max_tokens: [integer, optional] The maximum number of tokens to allow for each generated response message. Typically the best way to limit output length is by providing a length limit in the system prompt (for example, "limit your answers to three sentences"). Default: 4096, Range: 0 – 4096
  • temperature: [float, optional] How much variation to provide in each answer. Setting this value to 0 guarantees the same response to the same question every time. Setting a higher value encourages more variation. Modifies the distribution from which tokens are sampled. More information Default: 1.0, Range: 0.0 – 2.0
  • top_p: [float, optional] Limit the pool of next tokens in each step to the top N percentile of possible tokens, where 1.0 means the pool of all possible tokens, and 0.01 means the pool of only the most likely next tokens. More information Default: 1.0, Range: 0 <= value <=1.0
  • stop: [string | list[string], optional] End the message when the model generates one of these strings. The stop sequence is not included in the generated message. Each sequence can be up to 64K long, and can contain newlines as \n characters. Examples:
    • Single stop string with a word and a period: "monkeys."
    • Multiple stop strings and a newline: ["cat", "dog", " .", "####", "\n"]
  • n: [integer, optional] How many chat responses to generate. Default:1, Range: 1 – 16 Notes:
    • If n > 1, setting temperature=0 will fail because all answers are guaranteed to be duplicates.
    • n must be 1 when stream = True
  • stream: [boolean, optional] Whether or not to stream the result one token at a time using server-sent events . This can be useful when waiting for long results where a long wait time for an answer can be problematic, such as a chatbot. If set to true, then n must be 1. A streaming response is different than the non-streaming response.
  • logprobs: Legacy, ignore
  • top_logprobs: Legacy, ignore

Response details

Non-streaming results

A successful non-streamed response includes the following members:

  • id: [string] A unique ID for the request (not the message). Repeated identical requests get different IDs. However, for a streaming response, the ID will be the same for all responses in the stream.
  • choices: [list[object]] One or more responses, depending on the n parameter from the request. Each response includes the following members:
    • index: [integer] Zero-based index of the message in the list of messages. Note that this might not correspond with the position in the response list.
    • message: [object] The message generated by the model. Same structure as the request message, with role and content members.
    • finish_reason: [string] Why the message ended. Possible reasons:
      • stop: The response ended naturally as a complete answer (due to end-of-sequence token) or because the model generated a stop sequence provided in the request.
      • length: The response ended by reaching max_tokens.
  • usage: [object] The token counts for this request. Per-token billing is based on the prompt token and completion token counts and rates.
    • prompt_tokens: [integer] Number of tokens in the prompt for this request. Note that the prompt token includes the entire message history, plus extra tokens needed by the system when combining the list of prompt messages into a single message, as required by the model. The number of extra tokens is typically proportional to the number of messages in the thread, and should be relatively small.
    • completion_tokens: [integer]Number of tokens in the response message.
    • total_tokens: [integer] prompt_tokens + completion_tokens.

Streamed results

When you set stream=true in the request, you will get a sequence of messages, each with one token generated by the model. Read more about streaming calls using the SDK. The last message will be data: [DONE]. The other messages will have data set to a JSON object with the following fields:

  • data: [object] An object containing either an object with the following members, or the string "DONE" for the last message.
    • id: [string] A unique ID for the request (not the message). Repeated identical requests get different IDs. However, for a streaming response, the ID will be the same for all responses in the stream.
    • choices: [[object]] An array with one object containing the following fields:
      • index: [integer] Always zero.
      • delta [object]
        • The first message in the stream will be an object set to {"role":"assistant"}.
        • Subsequent messages will have an object {"content": __token__} with the generated token.
      • finish_reason: [string] One of the following values:
        • null: All messages but the last will return null for finish_reason.
        • stop: The response ended naturally as a complete answer (due to end-of-sequence token) or because the model generated a stop sequence provided in the request.
        • length: The response ended by reaching max_tokens.
    • usage: [object] The last message includes this field, which shows the total token counts for the request. Per-token billing is based on the prompt token and completion token counts and rates.
      • prompt_tokens: [integer] Number of tokens in the prompt for this request. Note that the token count includes extra tokens added by the system to format the input message list into the single string prompt required by the model. The number of extra tokens is typically proportional to the number of messages in the thread, and should be relatively small.
      • completion_tokens: [integer]Number of tokens in the response message.
      • total_tokens: [integer] prompt_tokens + completion_tokens.

Streaming example

Here is the (trimmed) response to a streaming request:

def test_streaming():
    messages = [ChatMessage(content="Who was the first emperor of rome", role="user")]
    client = AI21Client()
    response = client.chat.completions.create(
        messages=messages,
        model="jamba-instruct-preview",
        stream=True
    )
    for chunk in response:
        print(chunk.choices[0].delta.content, end="")
data: {"id": "cmpl-8e8b2f6556f94714b0cd5cfe3eeb45fc", "choices": [{"index": 0, "delta": {"role": "assistant"}, "logprobs": null, "finish_reason": null}]}
data: {"id": "cmpl-8e8b2f6556f94714b0cd5cfe3eeb45fc", "choices": [{"index": 0, "delta": {"content": ""}, "logprobs": null, "finish_reason": null}]}
data: {"id": "cmpl-8e8b2f6556f94714b0cd5cfe3eeb45fc", "choices": [{"index": 0, "delta": {"content": " The"}, "logprobs": null, "finish_reason": null}]}
data: {"id": "cmpl-8e8b2f6556f94714b0cd5cfe3eeb45fc", "choices": [{"index": 0, "delta": {"content": " first e"}, "logprobs": null, "finish_reason": null}]}
data: {"id": "cmpl-8e8b2f6556f94714b0cd5cfe3eeb45fc", "choices": [{"index": 0, "delta": {"content": "mpe"}, "logprobs": null, "finish_reason": null}]}
... 115 responses omitted for sanity ...
data: {"id": "cmpl-8e8b2f6556f94714b0cd5cfe3eeb45fc", "choices": [{"index": 0, "delta": {"content": "me"}, "logprobs": null, "finish_reason": null}]}
data: {"id": "cmpl-8e8b2f6556f94714b0cd5cfe3eeb45fc", "choices": [{"index": 0, "delta": {"content": "."}, "logprobs": null, "finish_reason": "stop"}], "usage": {"prompt_tokens": 107, "completion_tokens": 121, "total_tokens": 228}}
data: [DONE]