Overview
This guide walks you through self-deploying AI21’s Jamba models in your own infrastructure. Choose the deployment method that best fits your needs.
We recommend using vLLM version v0.6.5 to v0.8.5.post1 for optimal performance and compatibility.
Prerequisites
System Requirements
- Model Size: 96.07GB
- GPU Memory Required when Quantized: ~55GB
- Model Size: 743GB
- GPU Memory Required when Quantized: ~400GB total
Deployment Options
Option 1: vLLM Direct Usage
Create a Python virtual environment and install the vLLM package (version ≥0.6.5, ≤0.8.5.post1 to ensure maximum compatibility with all Jamba models).
# Create and activate virtual environment
python -m venv vllm-env
source vllm-env/bin/activate
# Install vLLM
pip install vllm>=0.6.5,<=0.8.5.post1
Authenticate on the HuggingFace Hub using your access token $HF_TOKEN:
huggingface-cli login --token $HF_TOKEN
Launch vLLM server for API-based inference:Start the vLLM server:vllm serve ai21labs/AI21-Jamba-Mini-1.6 \
--quantization="experts_int8" \
--tensor-parallel-size=8 \
--enable-auto-tool-choice \
--tool-call-parser jamba
Test the API:curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "ai21labs/AI21-Jamba-Mini-1.6",
"messages": [
{"role": "user", "content": "Who was the smartest person in history? Give reasons."}
]
}'
Start the vLLM server:vllm serve ai21labs/AI21-Jamba-Large-1.6 \
--quantization="experts_int8" \
--tensor-parallel-size=8 \
--enable-auto-tool-choice \
--tool-call-parser jamba
Test the API:curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "ai21labs/AI21-Jamba-Large-1.6",
"messages": [
{"role": "user", "content": "Who was the smartest person in history? Give reasons."}
]
}'
In offline mode, vLLM loads the model to perform batch inference tasks in a one-off, standalone manner.from vllm import LLM
from vllm.sampling_params import SamplingParams
model_name = "ai21labs/AI21-Jamba-Mini-1.6"
sampling_params = SamplingParams(max_tokens=1024)
llm = LLM(
model=model_name,
quantization="experts_int8",
tensor_parallel_size=8,
)
messages = [
{
"role": "user",
"content": "Who was the smartest person in history? Give reasons.",
}
]
res = llm.chat(messages=messages, sampling_params=sampling_params)
print(res[0].outputs[0].text)
Option 2: Quick Start with Docker
For containerized deployment, use vLLM’s official Docker image to run an inference server (refer to the vLLM Docker documentation for comprehensive details).
Pull the Docker image
docker pull vllm/vllm-openai:v0.8.5.post1
Run the container
Launch vLLM in server mode with your chosen model:docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<your-hf-token>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:v0.8.5.post1 \
--model ai21labs/AI21-Jamba-Mini-1.6 \
--quantization="experts_int8" \
--tensor-parallel-size=8 \
--enable-auto-tool-choice \
--tool-call-parser jamba
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<your-hf-token>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:v0.8.5.post1 \
--model ai21labs/AI21-Jamba-Large-1.6 \
--quantization="experts_int8" \
--tensor-parallel-size=8 \
--enable-auto-tool-choice \
--tool-call-parser jamba
Once the container is up and in healthy state, you will be able to test your inference using the same code samples as in the “Online Inference (Server Mode)” section.
If you prefer to use your own storage for model weights, you can download them from your self-hosted storage (e.g., AWS S3, Google Cloud Storage) and mount the local path to the container using -v /path/to/model:/mnt/model/ and --model="/mnt/model/" instead of the HuggingFace model identifier.
Next Steps
Resources