Build conversational experiences that interact with your organizational documents

The Conversational RAG endpoint enables you to build conversational experiences that interact with your organizational data. Using a chat interface allows your users to refine or ask follow-up questions, with context retained between questions. The solution leverages AI21's RAG Engine, which ensures that answers are based solely on information from your documents.

This system is designed to effortlessly extract the right information from your organizational data. You don’t need to invest any time in prompt engineering or crafting detailed system messages - simply ask your question (and as many follow-ups as you’d like) and get grounded answers.

👍

Try it out

Try it out in the playground in your browser. You'll need to upload some documents first.

Learn more about the RAG Engine.

To understand how to implement single- and multi-turn chat, read the Jamba Instruct documentation.

Request details

Endpoint: POST v1/conversational-rag

Request Parameters

Header parameters

Header: [required] Bearer token authorization required for all requests. Use your API key. Example: Authorization: Bearer asdfASDF5433

Body parameters

  • messages: [array of ChatMessage, required] The message history of this chat, from oldest (index 0) to newest. Messages must be alternating user/assistant messages, starting with a user message. Maximum total size for the array is about 256K tokens. Each message includes the following fields:
    • role: [string, required] The role of the message author. One of the following values:
      • user: This message is a user question.
      • assistant: This message is a response generated by the model.
    • content: [string, required] The content of the message.
  • path: [string, optional] If specified, only looks in documents with a path metadata that matches this path or path prefix. That is, specifying "/pets/" would match documents with the path "/pets/" or "/pets/dogs/". Use to focus the question on specific path in your library. See filtering documents .
  • labels: [array of strings, optional] Specify labels to restrict answers to documents with any of these labels. Labels are exact, case-sensitive matches, no substring matches. See filtering documents .
  • file_ids: [array of strings, optional] Specify which files should be included in the results. See filtering documents .
  • max_segments: [integer, optional] The maximum total number of document segments to use in formulating the answer. More segments means greater potential accuracy at the cost of speed and more tokens. If not specified, the system uses an optimal default.
  • retrieval_similarity_threshold: [float, optional] Range: 0.5 – 1.5 How "similar" a source segment should be to the query in order to be added to the context used to answer the question. Similarity is judged by the RAG engine's embedding values of the question and the source. If not specified, the system uses an optimal default.
  • retrieval_strategy: [enum, optional] Determines the scope of text segments added to the context during retrieval. One of the following values:
    • "segments" [default]: Retrieve segments within the retrieval_similarity_threshold.
    • "add_neighbors": Retrieve segments within the retrieval_similarity_threshold, plus neighboring segments. This helps provide more context to the model in order to provide a better answer, though the neighbors may be unrelated to the user query. Requires more memory than "segments" and may slow results. If specified, you can also set the max_neighbors value.
    • "full_doc": Use the entire document if it contains any information pertinent to the query. Highest potential accuracy, although extra information might move the answer more off topic than "add_neighbors", and also might result in the slowest response speed.
  • max_neighbors: [integer, optional] Used only when retrieval_strategy = add_neighbors. Specifies how many neighbor segments to combine with each candidate segment when generating the context during the retrieval step. Neighbors have a different topic then the candidate segment, but including them can add more context to the LLM, and potentially provide a more coherent answer. The actual number of neighbors added might be less if the segment is close to the beginning or end of the document.
  • hybrid_search_alpha: [float, optional] Defines the ratio between dense and sparse retrieval values used when evaluating segments in the library for eligibility for the context. Dense values are the embedding value, a conceptual or topical representation of the segment, as represented by a large vector. Sparse values is more like keyword search within the segment. 1.0 means using only dense embeddings; 0.0 means using only sparse embeddings. If you want to limit your sources to those that use specific terms, and your answers seem to broad, you might lower this value slightly. If not specified, the system uses an optimal default. Range: 0.0 – 1.0

Filtering documents

You can filter the pool of potential documents by document ID, label, or path. Note that these are intersection filters -- that is, if you specify both a label value and a path value, only documents with both the label and the path will be matched. The only variation is the labels parameter, where any label in the list can be matched.

FiltersMatching docs
labels=["red", "green", "blue"]matches label "red" OR "green" OR "blue"
labels=["red", "green", "blue"]
AND
path="/colors/"
matches (label "red" OR "green" OR "blue") AND path="/colors/any/other/suffix"

Response details

A successful response includes the following fields:

  • answer_in_context: [boolean] True if an answer was found in the provided documents, False if an answer could not be found. It can be simpler to check this value rather than to look at the response text and evaluate if it includes an answer.
  • context_retrieved: [boolean] True if the RAG engine was able to find segments related to the user's query.
  • choices: [array of objects] An array with one object that holds the generated response.
    • message[object] Contains the following fields:
      • role: [string] Always assistant
      • content: [string] The generated answer. If an answer cannot be found, it will say so in natural language.
  • id: [string] A unique ID for the request (not the message). Repeated identical requests get different IDs. However, for a streaming response, the ID will be the same for all responses in the stream.
  • search_queries: [array of strings or null] The questions that the model extracted from the user input thread. The model extracts the question from the most recent user message, taking into account the entire message history. If there isn’t a relevant query in the message, this will be None and nothing will be retrieved.
  • sources: An array of objects, where each object represents a segment used to generate the answer. Each source object contains the following fields:
    • file_id: The ID of the file in the RAG library that contains this segment.
    • file_name: The name of the source document in the RAG library that contains this segment.
    • text: The full text of the retrieved segment.
    • score: The similarity score between this segment and the user's query.
    • public_url: A URL pointing to the source document (if available). This is the publicUrl metadata provided (if any) by the caller when they uploaded the file to the library.
  • usage: [object] The token counts for this request. Per-token billing is based on the prompt token and completion token counts and rates.
    • prompt_tokens: [integer] Number of tokens in the messages list in the request. Note that this includes all content of the message history plus some extra tokens needed by the system when combining the list of prompt messages into a single message. The number of extra tokens is relatively small compared to the tokens in the message history.
    • completion_tokens: [integer]Number of tokens in the response message.
    • total_tokens: [integer] prompt_tokens + completion_tokens

Example

# Raw REST, not using the API
import requests
ROOT_URL = "https://api.ai21.com/studio/v1/"

# RAG engine answers about Spain from your library
def chat_with_library():
  url = ROOT_URL + "conversational-rag"
  data = {
    "messages": [],
    "labels":["spain"]
  }
  QUIT_STRING = "//"
  quit_message = f"(Reply {QUIT_STRING} to quit)"

  user_message = input(f"What would you like to know about Spain? {quit_message} ")
  while user_message != QUIT_STRING:
    data["messages"].append({"role":"user", "content":user_message})
    res = requests.post(
      headers={"Authorization": f"Bearer {AI21_API_KEY}"},
      url=url,
      json=data
    )
    print("Response JSON: ", res.json())

    if res.json()["search_queries"] == None:
      print("Couldn't parse a query")
    else:
      print("What we think you asked: ", res.json()["search_queries"])

    answer = res.json()["choices"][0]["content"]
    data["messages"].append({"role":"assistant","content":answer})
    print(answer)

    # We could parse the question but not find enough library material for an answer
    if(not res.json()["context_retrieved"] and res.json()["search_queries"] != None):
      print("You need more books in your library! ")

    user_message = input(f"What else would you like to know about Spain? {quit_message} ")