vLLM

Overview

This guide walks you through self-deploying AI21’s Jamba models in your own infrastructure using vLLM. Choose the deployment method that best fits your needs.

We recommend using vLLM version v0.6.5 to v0.8.5.post1 for optimal performance and compatibility.

Prerequisites

For detailed information about hardware support and GPU requirements, see the vLLM GPU Installation Guide.

System Requirements

Jamba Mini
Jamba Large
Jamba Mini Experts Int8
Jamba Large Experts Int8
Jamba Mini FP8
Jamba Large FP8

Model Size: 97GB
Compute Capability 7.5+

Deployment Options

Option 1: vLLM Direct Usage

Create a Python virtual environment and install the vLLM package (version ≥0.6.5, ≤0.8.5.post1 to ensure maximum compatibility with all Jamba models).

# Create and activate virtual environment
python -m venv vllm-env
source vllm-env/bin/activate

# Install vLLM
pip install vllm>=0.6.5,<=0.8.5.post1

Authenticate on the HuggingFace Hub using your access token $HF_TOKEN:

huggingface-cli login --token $HF_TOKEN

Online Inference (Server Mode)
Offline Inference

Launch vLLM server for API-based inference:

Jamba Mini
Jamba Large
Jamba Mini FP8
Jamba Large FP8

Start the vLLM server:

vllm serve ai21labs/AI21-Jamba-Mini-1.7 \
  --quantization="experts_int8" \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

Test the API:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai21labs/AI21-Jamba-Mini-1.7",
    "messages": [
      {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
    ]
  }'

Option 2: Quick Start with Docker

For containerized deployment, use vLLM’s official Docker image to run an inference server (refer to the vLLM Docker documentation for comprehensive details).

Pull the Docker image

docker pull vllm/vllm-openai:v0.8.5.post1

Run the container

Launch vLLM in server mode with your chosen model:

Jamba Mini
Jamba Large
Jamba Mini FP8
Jamba Large FP8

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=<your-hf-token>" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:v0.8.5.post1 \
  --model ai21labs/AI21-Jamba-Mini-1.7 \
  --quantization="experts_int8" \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

Once the container is up and in healthy state, you will be able to test your inference using the same code samples as in the Online Inference (Server Mode) section. Make sure to use the correct model identifier based on your chosen quantization approach.

If you prefer to use your own storage for model weights, you can download them from your self-hosted storage (e.g., AWS S3, Google Cloud Storage) and mount the local path to the container using -v /path/to/model:/mnt/model/ and --model="/mnt/model/" instead of the HuggingFace model identifier.

Getting Started

AI21 Maestro

Foundation Models

Private AI

AI21 Studio

Model Preparation

Usage

AI Ethics & Data Transperancy

Additional Resources

Overview

Prerequisites

System Requirements

Deployment Options

Option 1: vLLM Direct Usage

Option 2: Quick Start with Docker

Next Steps

Cloud Platform Deployment

Troubleshooting & Performance

API Reference

Resources

Getting Started

AI21 Maestro

Foundation Models

Private AI

AI21 Studio

Model Preparation

Usage

AI Ethics & Data Transperancy

Additional Resources

​Overview

​Prerequisites

​System Requirements

​Deployment Options

​Option 1: vLLM Direct Usage

​Option 2: Quick Start with Docker

​Next Steps

Cloud Platform Deployment

Troubleshooting & Performance

API Reference

​Resources

Overview

Prerequisites

System Requirements

Deployment Options

Option 1: vLLM Direct Usage

Option 2: Quick Start with Docker

Next Steps

Resources