Conversational RAG

Build conversational experiences that interact with your organizational documents

Overview

The Conversational RAG endpoint enables you to build conversational experiences that interact with your organizational data. Using a chat interface allows your users to refine or ask follow-up questions, with context retained between questions. The solution leverages AI21's RAG Engine, which ensures that answers are based solely on information from your documents.

This system is designed to effortlessly extract the right information from your organizational data.
Just ask your question (and any follow-ups), and get clear, accurate answers—no need for prompt engineering or detailed system messages.

Authentication

This header is required for all API requests. It must include a Bearer token for authentication.
Use your API key to generate the token.

Format:
Authorization: Bearer <your-api-key>


Request details

Endpoint: POST v1/conversational-rag

Request body


messages array of ChatMessage Required
The message history of this chat, from oldest (index 0) to newest. Messages must be alternating user/assistant messages, starting with a user message. Maximum total size for the array is about 256K tokens. Each message includes the following fields.

Show fields

role string Required
The role of the message author. One of the following values:

user: This message is a user question.


assistant: This message is a response generated by the model.


content string Required
The content of the message.


path string

If specified, only looks in documents with a path metadata that matches this path or path prefix. That is, specifying "/pets/" would match documents with the path "/pets/" or "/pets/dogs/". Use to focus the question on specific path in your library. See filtering documents.


labels array of strings
Specify labels to restrict answers to documents with any of these labels. Labels are exact, case-sensitive matches, no substring matches. See filtering documents.


file_ids array of strings
Specify which files should be included in the results. See filtering documents.


max_segments integer
The maximum total number of document segments to use in formulating the answer. More segments means greater potential accuracy at the cost of speed and more tokens. If not specified, the system uses an optimal default.


retrievalsimilarity_threshold float
Range: 0.5 – 1.5_ How "similar" a source segment should be to the query in order to be added to the context used to answer the question. Similarity is judged by the RAG engine's embedding values of the question and the source. If not specified, the system uses an optimal default.


retrieval_strategy enum
Determines the scope of text segments added to the context during retrieval.

Show values

segments [default]: Retrieve segments within the retrieval_similarity_threshold.

add_neighbors: Retrieve segments within the retrieval_similarity_threshold, plus neighboring segments. This helps provide more context to the model in order to provide a better answer, though the neighbors may be unrelated to the user query. Requires more memory than "segments" and may slow results. If specified, you can also set the max_neighborsvalue.

full_doc: Use the entire document if it contains any information pertinent to the query. Highest potential accuracy, although extra information might move the answer more off topic than "add_neighbors", and also might result in the slowest response speed.


max_neighbors integer
Used only whenretrieval_strategy = add_neighbors. Specifies how many neighbor segments to combine with each candidate segment when generating the context during the retrieval step. Neighbors have a different topic then the candidate segment, but including them can add more context to the LLM, and potentially provide a more coherent answer. The actual number of neighbors added might be less if the segment is close to the beginning or end of the document.


hybridsearch_alpha float
Defines the ratio between dense and sparse retrieval values used when evaluating segments in the library for eligibility for the context. Dense values are the embedding value, a conceptual or topical representation of the segment, as represented by a large vector. Sparse values is more like keyword search within the segment. 1.0 means using only dense embeddings; 0.0 means using only sparse embeddings. If you want to limit your sources to those that use specific terms, and your answers seem to broad, you might lower this value slightly. If not specified, the system uses an optimal default. Range: 0.0 – 1.0

Filtering documents
You can filter the pool of potential documents by document ID, label, or path. Note that these are intersection filters -- that is, if you specify both a label value and a path value, only documents with both the label and the path will be matched. The only variation is the labels parameter, where any label in the list can be matched.

FiltersMatching docs
labels=["red", "green", "blue"]matches label "red" OR "green" OR "blue"
labels=["red", "green", "blue"]
AND
path="/colors/"
matches (label "red" OR "green" OR "blue") AND path="/colors/any/other/suffix"

Response details


A successful response includes the following fields:

answer_in_context boolean
True if an answer was found in the provided documents, False if an answer could not be found.
It can be simpler to check this value rather than to look at the response text and evaluate if it includes an answer.


context_retrieved boolean
True if the RAG engine was able to find segments related to the user's query.


choices array of objects
An array with one object that holds the generated response.


message object
Contains the following fields

Show fields

role string
Always assistant

content string
The generated answer. If an answer cannot be found, it will say so in natural language.


id string
A unique ID for the request (not the message). Repeated identical requests get different IDs.
However, for a streaming response, the ID will be the same for all responses in the stream.


search_queries array of strings or null
The questions that the model extracted from the user input thread. The model extracts the question from the most recent user message, taking into account the entire message history. If there isn’t a relevant query in the message, this will be None and nothing will be retrieved.


sources array of objects
Each object represents a segment used to generate the answer.
Each source object contains the following fields.

Show fields
  • file_id: The ID of the file in the RAG library that contains this segment.
  • file_name: The name of the source document in the RAG library that contains this segment.
  • text: The full text of the retrieved segment.
  • score: The similarity score between this segment and the user's query.
  • public_url: A URL pointing to the source document (if available). This is the publicUrl metadata provided (if any) by the caller when they uploaded the file to the library.

usage object
The token counts for this request.
Per-token billing is based on the prompt token and completion token counts and rates.

Show parameters

prompt_tokens integer
Number of tokens in the prompt for this request.
The prompt token contains the entire message history and extra tokens for combining messages, proportional to the number of messages.


completion_tokens integer
Number of tokens in the response message.


total_tokens integer
prompt_tokens and completion_tokens.

Example

# Raw REST, not using the API
import requests
ROOT_URL = "https://api.ai21.com/studio/v1/"

# RAG engine answers about Spain from your library
def chat_with_library():
  url = ROOT_URL + "conversational-rag"
  data = {
    "messages": [],
    "labels":["spain"]
  }
  QUIT_STRING = "//"
  quit_message = f"(Reply {QUIT_STRING} to quit)"

  user_message = input(f"What would you like to know about Spain? {quit_message} ")
  while user_message != QUIT_STRING:
    data["messages"].append({"role":"user", "content":user_message})
    res = requests.post(
      headers={"Authorization": f"Bearer {AI21_API_KEY}"},
      url=url,
      json=data
    )
    print("Response JSON: ", res.json())

    if res.json()["search_queries"] == None:
      print("Couldn't parse a query")
    else:
      print("What we think you asked: ", res.json()["search_queries"])

    answer = res.json()["choices"][0]["content"]
    data["messages"].append({"role":"assistant","content":answer})
    print(answer)

    # We could parse the question but not find enough library material for an answer
    if(not res.json()["context_retrieved"] and res.json()["search_queries"] != None):
      print("You need more books in your library! ")

    user_message = input(f"What else would you like to know about Spain? {quit_message} ")