Overview
This guide walks you through self-deploying AI21’s Jamba models in your own infrastructure using vLLM. Choose the deployment method that best fits your needs.We recommend using vLLM version
v0.6.5
to v0.8.5.post1
for optimal performance and compatibility.Prerequisites
For detailed information about hardware support and GPU requirements, see the vLLM GPU Installation Guide.System Requirements
- Model Size: 97GB
- Compute Capability 7.5+
Deployment Options
Option 1: vLLM Direct Usage
Create a Python virtual environment and install the vLLM package (version≥0.6.5, ≤0.8.5.post1
to ensure maximum compatibility with all Jamba models).
$HF_TOKEN
:
Launch vLLM server for API-based inference:
Start the vLLM server:Test the API:
Option 2: Quick Start with Docker
For containerized deployment, use vLLM’s official Docker image to run an inference server (refer to the vLLM Docker documentation for comprehensive details).1
Pull the Docker image
2
Run the container
Launch vLLM in server mode with your chosen model:Once the container is up and in healthy state, you will be able to test your inference using the same code samples as in the Online Inference (Server Mode) section. Make sure to use the correct model identifier based on your chosen quantization approach.
If you prefer to use your own storage for model weights, you can download them from your self-hosted storage (e.g., AWS S3, Google Cloud Storage) and mount the local path to the container using
-v /path/to/model:/mnt/model/
and --model="/mnt/model/"
instead of the HuggingFace model identifier.Next Steps
Cloud Platform Deployment
Deploy on AWS, Google Cloud, or Azure for production workloads
Troubleshooting & Performance
Optimize performance and resolve common deployment issues
API Reference
Learn about the complete API interface and parameters