This guide walks you through self-deploying AI21’s Jamba models in your own infrastructure using vLLM. Choose the deployment method that best fits your needs.
We recommend using vLLM version v0.6.5 to v0.8.5.post1 for optimal performance and compatibility.
curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "ai21labs/AI21-Jamba-Mini-1.6", "messages": [ {"role": "user", "content": "Who was the smartest person in history? Give reasons."} ] }'
curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "ai21labs/AI21-Jamba-Mini-1.6", "messages": [ {"role": "user", "content": "Who was the smartest person in history? Give reasons."} ] }'
curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "ai21labs/AI21-Jamba-Large-1.6", "messages": [ {"role": "user", "content": "Who was the smartest person in history? Give reasons."} ] }'
curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "ai21labs/AI21-Jamba-Mini-1.6", "messages": [ {"role": "user", "content": "Who was the smartest person in history? Give reasons."} ] }'
curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "ai21labs/AI21-Jamba-Mini-1.6", "messages": [ {"role": "user", "content": "Who was the smartest person in history? Give reasons."} ] }'
curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "ai21labs/AI21-Jamba-Large-1.6", "messages": [ {"role": "user", "content": "Who was the smartest person in history? Give reasons."} ] }'
In offline mode, vLLM loads the model to perform batch inference tasks in a one-off, standalone manner.
Copy
Ask AI
from vllm import LLMfrom vllm.sampling_params import SamplingParamsmodel_name = "ai21labs/AI21-Jamba-Mini-1.6"sampling_params = SamplingParams(max_tokens=1024)llm = LLM( model=model_name, quantization="experts_int8", tensor_parallel_size=8,)messages = [ { "role": "user", "content": "Who was the smartest person in history? Give reasons.", }]res = llm.chat(messages=messages, sampling_params=sampling_params)print(res[0].outputs[0].text)
For containerized deployment, use vLLM’s official Docker image to run an inference server (refer to the vLLM Docker documentation for comprehensive details).
1
Pull the Docker image
Copy
Ask AI
docker pull vllm/vllm-openai:v0.8.5.post1
2
Run the container
Launch vLLM in server mode with your chosen model:
Once the container is up and in healthy state, you will be able to test your inference using the same code samples as in the Online Inference (Server Mode) section.
If you prefer to use your own storage for model weights, you can download them from your self-hosted storage (e.g., AWS S3, Google Cloud Storage) and mount the local path to the container using -v /path/to/model:/mnt/model/ and --model="/mnt/model/" instead of the HuggingFace model identifier.