Self-deployment guide

This guide provides detailed instructions on how to self-deploy the Jamba Mini and Jamba Large models.

Self-Deployment Options

Option 1: Direct Download from HuggingFace

  1. Download base docker image.
  2. Download the model weights and the tokenizer from HuggingFace:

Option 2: Using a Specific Platform

  • For platforms like SageMaker, we provide tailored guides. You can find the SageMaker guide here.

Option 3: Running vLLM docker (On-Premises)

We recommend using vLLM version v0.6.5 to v0.7.3.

NVIDIA Stack

  • NVIDIA Driver Version: 535.x.y
  • CUDA 12.1

While earlier versions may be compatible, they have not been tested and may result in less optimized performance or lack support for certain features.

When deploying an LLM, you need the following components:

  1. Hardware (Compute):

Jamba Mini

With the default BF16 precision on 2 80GB A100 GPUs and default vLLM configuration, you'll be able to perform inference on prompts up to 200K tokens long. On more than 2 80GB GPUs, you can easily fit the full 256K context.

Note: vLLM's main branch has some memory utilization improvements specific to the Jamba architecture that allow using the full 256K context length on 2 80 GPUs.

Jamba Large

Jamba Large 1.6 is too large to be loaded in full (FP32) or half (FP16/BF16) precision on a single node of 8 80GB GPUs. Therefore, quantization is required. We've developed an innovative and efficient quantization technique, ExpertsInt8, designed for MoE models deployed in vLLM, including Jamba models. Using it, you'll be able to deploy Jamba Large 1.6 on a single node of 8 80GB GPUs.

  1. Runtime Environment (Execution Environment) This includes the OS, Frameworks, libraries and any dependencies required for the model to function. See this docker. We support the latest version of vLLM.

  2. Prepare host environment for VLLM deployment Pull the Docker image and prepare a working directory.

  3. Download the model directly from Hugging Face Jamba Large: https://huggingface.co/ai21labs/AI21-Jamba-Large-1.6/edit/main/README.md Jamba Mini: https://huggingface.co/ai21labs/AI21-Jamba-Mini-

  4. Run vLLM docker container in ‘offline’ mode

  • Set OFFLINE mode
  • Map local model directory to the container path ‘/mnt/model/’
  • Set quantization to ‘experts_int8’ for better throughput and usage in smaller instances
  • Use the container model path as the name of the model:
docker run --gpus all  
-v /root/models/my-local-model-dir/:/mnt/model/  
-p 8000:8000  
--env "TRANSFORMERS_OFFLINE=1"  
--env "HF_DATASET_OFFLINE=1"  
--quantization="experts_int8"  
--ipc=host vllm/vllm-openai:latest  
--model="/mnt/model/"
  1. Call API with model name set to
curl -X POST "http://localhost:8000/v1/chat/completions"  
-H "Content-Type: application/json"  
--data '{  
	"model": "/mnt/model/",  
	"messages": [  
		{"role": "user", "content": "Hello!"}  
	]  
}'

References