Overview

This guide walks you through self-deploying AI21’s Jamba models in your own infrastructure using vLLM. Choose the deployment method that best fits your needs.

We recommend using vLLM version v0.6.5 to v0.8.5.post1 for optimal performance and compatibility.

Prerequisites

For detailed information about hardware support and GPU requirements, see the vLLM GPU Installation Guide.

System Requirements

  • Model Size: 96.07GB
  • GPU Memory Required when Quantized: ~55GB

Deployment Options

Option 1: vLLM Direct Usage

Create a Python virtual environment and install the vLLM package (version ≥0.6.5, ≤0.8.5.post1 to ensure maximum compatibility with all Jamba models).

# Create and activate virtual environment
python -m venv vllm-env
source vllm-env/bin/activate

# Install vLLM
pip install vllm>=0.6.5,<=0.8.5.post1

Authenticate on the HuggingFace Hub using your access token $HF_TOKEN:

huggingface-cli login --token $HF_TOKEN

Launch vLLM server for API-based inference:

Start the vLLM server:

vllm serve ai21labs/AI21-Jamba-Mini-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

Test the API:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai21labs/AI21-Jamba-Mini-1.6",
    "messages": [
      {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
    ]
  }'

Option 2: Quick Start with Docker

For containerized deployment, use vLLM’s official Docker image to run an inference server (refer to the vLLM Docker documentation for comprehensive details).

1

Pull the Docker image

docker pull vllm/vllm-openai:v0.8.5.post1
2

Run the container

Launch vLLM in server mode with your chosen model:

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=<your-hf-token>" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:v0.8.5.post1 \
  --model ai21labs/AI21-Jamba-Mini-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

Once the container is up and in healthy state, you will be able to test your inference using the same code samples as in the Online Inference (Server Mode) section.

If you prefer to use your own storage for model weights, you can download them from your self-hosted storage (e.g., AWS S3, Google Cloud Storage) and mount the local path to the container using -v /path/to/model:/mnt/model/ and --model="/mnt/model/" instead of the HuggingFace model identifier.

Next Steps

Resources