> ## Documentation Index
> Fetch the complete documentation index at: https://docs.ai21.com/llms.txt
> Use this file to discover all available pages before exploring further.

# vLLM

> Deploy AI21's Jamba models using vLLM in your own environment. vLLM is an open-source library for high-throughput LLM inference and serving.

## Overview

This guide walks you through self-deploying AI21's Jamba models in your own infrastructure using [vLLM](https://github.com/vllm-project/vllm). Choose the deployment method that best fits your needs.

<Note>
  We recommend using vLLM version `v0.6.5` to `v0.8.5.post1` for optimal performance and compatibility.
</Note>

## Prerequisites

For detailed information about hardware support and GPU requirements, see the [vLLM GPU Installation Guide](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html).

### System Requirements

<Tabs>
  <Tab title="Jamba Mini">
    <Card>
      * **Model Size:** 97GB
      * **Compute Capability** 7.5+
    </Card>
  </Tab>

  <Tab title="Jamba Large">
    <Card>
      * **Model Size:** 743GB
      * **Compute Capability** 7.5+
    </Card>
  </Tab>

  <Tab title="Jamba Mini Experts Int8">
    <Card>
      * **Model Size:** 55GB
      * **Compute Capability** 7.5+
    </Card>
  </Tab>

  <Tab title="Jamba Large Experts Int8">
    <Card>
      * **Model Size:** 560GB
      * **Compute Capability** 7.5+
    </Card>
  </Tab>

  <Tab title="Jamba Mini FP8">
    <Card>
      * **Model Size:** 52GB
      * **Compute Capability** 9.0+
    </Card>
  </Tab>

  <Tab title="Jamba Large FP8">
    <Card>
      * **Model Size:** 396GB
      * **Compute Capability** 9.0+
    </Card>
  </Tab>
</Tabs>

## Deployment Options

### Option 1: vLLM Direct Usage

Create a Python virtual environment and install the vLLM package (version `≥0.6.5, ≤0.8.5.post1` to ensure maximum compatibility with all Jamba models).

```bash theme={"system"}
# Create and activate virtual environment
python -m venv vllm-env
source vllm-env/bin/activate

# Install vLLM
pip install vllm>=0.6.5,<=0.8.5.post1
```

Authenticate on the HuggingFace Hub using your access token `$HF_TOKEN`:

```bash theme={"system"}
huggingface-cli login --token $HF_TOKEN
```

<Tabs>
  <Tab title="Online Inference (Server Mode)">
    Launch vLLM server for API-based inference:

    <Tabs>
      <Tab title="Jamba Mini">
        **Start the vLLM server:**

        ```bash theme={"system"}
        vllm serve ai21labs/AI21-Jamba-Mini-1.7 \
          --quantization="experts_int8" \
          --enable-auto-tool-choice \
          --tool-call-parser jamba
        ```

        **Test the API:**

        <CodeGroup>
          ```bash cURL theme={"system"}
          curl -X POST "http://localhost:8000/v1/chat/completions" \
            -H "Content-Type: application/json" \
            -d '{
              "model": "ai21labs/AI21-Jamba-Mini-1.7",
              "messages": [
                {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
              ]
            }'
          ```

          ```python Python theme={"system"}
          import httpx

          url = "http://localhost:8000/v1/chat/completions"
          headers = {
              "Content-Type": "application/json"
          }
          data = {
              "model": "ai21labs/AI21-Jamba-Mini-1.7",
              "messages": [
                    {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
              ],
           }

          response = httpx.post(url, headers=headers, json=data)
          print(response.json())
          ```

          ```python AI21 Python SDK theme={"system"}
          from ai21 import AI21Client

          client = AI21Client(
              api_key="dummy-key",  # Not needed for local vLLM server
              api_host="http://localhost:8000/v1"
          )

          response = client.chat.completions.create(
              model="ai21labs/AI21-Jamba-Mini-1.7",
              messages=[
                  {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
              ]
          )

          print(response.choices[0].message.content)
          ```
        </CodeGroup>
      </Tab>

      <Tab title="Jamba Large">
        **Start the vLLM server:**

        ```bash theme={"system"}
        vllm serve ai21labs/AI21-Jamba-Large-1.7 \
          --quantization="experts_int8" \
          --tensor-parallel-size=8 \
          --enable-auto-tool-choice \
          --tool-call-parser jamba
        ```

        **Test the API:**

        <CodeGroup>
          ```bash cURL theme={"system"}
          curl -X POST "http://localhost:8000/v1/chat/completions" \
            -H "Content-Type: application/json" \
            -d '{
              "model": "ai21labs/AI21-Jamba-Large-1.7",
              "messages": [
                {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
              ]
            }'
          ```

          ```python Python theme={"system"}
          import httpx

          url = "http://localhost:8000/v1/chat/completions"
          headers = {
              "Content-Type": "application/json"
          }
          data = {
              "model": "ai21labs/AI21-Jamba-Large-1.7",
              "messages": [
                    {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
              ],
           }

          response = httpx.post(url, headers=headers, json=data)
          print(response.json())
          ```

          ```python AI21 Python SDK theme={"system"}
          from ai21 import AI21Client

          client = AI21Client(
              api_key="dummy-key",  # Not needed for local vLLM server
              api_host="http://localhost:8000/v1"
          )

          response = client.chat.completions.create(
              model="ai21labs/AI21-Jamba-Large-1.7",
              messages=[
                  {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
              ]
          )

          print(response.choices[0].message.content)
          ```
        </CodeGroup>
      </Tab>

      <Tab title="Jamba Mini FP8">
        **Start the vLLM server:**

        ```bash theme={"system"}
        vllm serve ai21labs/AI21-Jamba-Mini-1.7-FP8 \
          --enable-auto-tool-choice \
          --tool-call-parser jamba
        ```

        **Test the API:**

        <CodeGroup>
          ```bash cURL theme={"system"}
          curl -X POST "http://localhost:8000/v1/chat/completions" \
            -H "Content-Type: application/json" \
            -d '{
              "model": "ai21labs/AI21-Jamba-Mini-1.7-FP8",
              "messages": [
                {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
              ]
            }'
          ```

          ```python Python theme={"system"}
          import httpx

          url = "http://localhost:8000/v1/chat/completions"
          headers = {
              "Content-Type": "application/json"
          }
          data = {
              "model": "ai21labs/AI21-Jamba-Mini-1.7-FP8",
              "messages": [
                    {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
              ],
           }

          response = httpx.post(url, headers=headers, json=data)
          print(response.json())
          ```

          ```python AI21 Python SDK theme={"system"}
          from ai21 import AI21Client

          client = AI21Client(
              api_key="dummy-key",  # Not needed for local vLLM server
              api_host="http://localhost:8000/v1"
          )

          response = client.chat.completions.create(
              model="ai21labs/AI21-Jamba-Mini-1.7-FP8",
              messages=[
                  {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
              ]
          )

          print(response.choices[0].message.content)
          ```
        </CodeGroup>
      </Tab>

      <Tab title="Jamba Large FP8">
        **Start the vLLM server:**

        ```bash theme={"system"}
        vllm serve ai21labs/AI21-Jamba-Large-1.7-FP8 \
          --tensor-parallel-size=8 \
          --enable-auto-tool-choice \
          --tool-call-parser jamba
        ```

        **Test the API:**

        <CodeGroup>
          ```bash cURL theme={"system"}
          curl -X POST "http://localhost:8000/v1/chat/completions" \
            -H "Content-Type: application/json" \
            -d '{
              "model": "ai21labs/AI21-Jamba-Large-1.7-FP8",
              "messages": [
                {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
              ]
            }'
          ```

          ```python Python theme={"system"}
          import httpx

          url = "http://localhost:8000/v1/chat/completions"
          headers = {
              "Content-Type": "application/json"
          }
          data = {
              "model": "ai21labs/AI21-Jamba-Large-1.7-FP8",
              "messages": [
                    {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
              ],
           }

          response = httpx.post(url, headers=headers, json=data)
          print(response.json())
          ```

          ```python AI21 Python SDK theme={"system"}
          from ai21 import AI21Client

          client = AI21Client(
              api_key="dummy-key",  # Not needed for local vLLM server
              api_host="http://localhost:8000/v1"
          )

          response = client.chat.completions.create(
              model="ai21labs/AI21-Jamba-Large-1.7-FP8",
              messages=[
                  {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
              ]
          )

          print(response.choices[0].message.content)
          ```
        </CodeGroup>
      </Tab>
    </Tabs>
  </Tab>

  <Tab title="Offline Inference">
    In offline mode, vLLM loads the model to perform batch inference tasks in a one-off, standalone manner.

    <CodeGroup>
      ```python Jamba Mini theme={"system"}
      from vllm import LLM
      from vllm.sampling_params import SamplingParams

      model_name = "ai21labs/AI21-Jamba-Mini-1.7"
      sampling_params = SamplingParams(max_tokens=1024)

      llm = LLM(
          model=model_name,
          quantization="experts_int8",
      )

      messages = [
          {
              "role": "user",
              "content": "Who was the smartest person in history? Give reasons.",
          }
      ]

      res = llm.chat(messages=messages, sampling_params=sampling_params)
      print(res[0].outputs[0].text)
      ```

      ```python Jamba Large theme={"system"}
      from vllm import LLM
      from vllm.sampling_params import SamplingParams

      model_name = "ai21labs/AI21-Jamba-Large-1.7"
      sampling_params = SamplingParams(max_tokens=1024)

      llm = LLM(
          model=model_name,
          quantization="experts_int8",
          tensor_parallel_size=8,
      )

      messages = [
          {
              "role": "user",
              "content": "Who was the smartest person in history? Give reasons."
          }
      ]

      res = llm.chat(messages=messages, sampling_params=sampling_params)
      print(res[0].outputs[0].text)
      ```

      ```python Jamba Mini FP8 theme={"system"}
      from vllm import LLM
      from vllm.sampling_params import SamplingParams

      model_name = "ai21labs/AI21-Jamba-Mini-1.7-FP8"
      sampling_params = SamplingParams(max_tokens=1024)

      llm = LLM(
          model=model_name,
      )

      messages = [
          {
              "role": "user",
              "content": "Who was the smartest person in history? Give reasons.",
          }
      ]

      res = llm.chat(messages=messages, sampling_params=sampling_params)
      print(res[0].outputs[0].text)
      ```

      ```python Jamba Large FP8 theme={"system"}
      from vllm import LLM
      from vllm.sampling_params import SamplingParams

      model_name = "ai21labs/AI21-Jamba-Large-1.7-FP8"
      sampling_params = SamplingParams(max_tokens=1024)

      llm = LLM(
          model=model_name,
          tensor_parallel_size=8,
      )

      messages = [
          {
              "role": "user",
              "content": "Who was the smartest person in history? Give reasons."
          }
      ]

      res = llm.chat(messages=messages, sampling_params=sampling_params)
      print(res[0].outputs[0].text)
      ```
    </CodeGroup>
  </Tab>
</Tabs>

### Option 2: Quick Start with Docker

For containerized deployment, use vLLM's official Docker image to run an inference server (refer to the [vLLM Docker documentation](https://docs.vllm.ai/en/latest/deployment/docker.html) for comprehensive details).

<Steps>
  <Step title="Pull the Docker image">
    ```bash theme={"system"}
    docker pull vllm/vllm-openai:v0.8.5.post1
    ```
  </Step>

  <Step title="Run the container">
    Launch vLLM in server mode with your chosen model:

    <Tabs>
      <Tab title="Jamba Mini">
        ```bash theme={"system"}
        docker run --runtime nvidia --gpus all \
          -v ~/.cache/huggingface:/root/.cache/huggingface \
          --env "HUGGING_FACE_HUB_TOKEN=<your-hf-token>" \
          -p 8000:8000 \
          --ipc=host \
          vllm/vllm-openai:v0.8.5.post1 \
          --model ai21labs/AI21-Jamba-Mini-1.7 \
          --quantization="experts_int8" \
          --enable-auto-tool-choice \
          --tool-call-parser jamba
        ```
      </Tab>

      <Tab title="Jamba Large">
        ```bash theme={"system"}
        docker run --runtime nvidia --gpus all \
          -v ~/.cache/huggingface:/root/.cache/huggingface \
          --env "HUGGING_FACE_HUB_TOKEN=<your-hf-token>" \
          -p 8000:8000 \
          --ipc=host \
          vllm/vllm-openai:v0.8.5.post1 \
          --model ai21labs/AI21-Jamba-Large-1.7 \
          --quantization="experts_int8" \
          --tensor-parallel-size=8 \
          --enable-auto-tool-choice \
          --tool-call-parser jamba
        ```
      </Tab>

      <Tab title="Jamba Mini FP8">
        ```bash theme={"system"}
        docker run --runtime nvidia --gpus all \
          -v ~/.cache/huggingface:/root/.cache/huggingface \
          --env "HUGGING_FACE_HUB_TOKEN=<your-hf-token>" \
          -p 8000:8000 \
          --ipc=host \
          vllm/vllm-openai:v0.8.5.post1 \
          --model ai21labs/AI21-Jamba-Mini-1.7-FP8 \
          --enable-auto-tool-choice \
          --tool-call-parser jamba
        ```
      </Tab>

      <Tab title="Jamba Large FP8">
        ```bash theme={"system"}
        docker run --runtime nvidia --gpus all \
          -v ~/.cache/huggingface:/root/.cache/huggingface \
          --env "HUGGING_FACE_HUB_TOKEN=<your-hf-token>" \
          -p 8000:8000 \
          --ipc=host \
          vllm/vllm-openai:v0.8.5.post1 \
          --model ai21labs/AI21-Jamba-Large-1.7-FP8 \
          --tensor-parallel-size=8 \
          --enable-auto-tool-choice \
          --tool-call-parser jamba
        ```
      </Tab>
    </Tabs>

    Once the container is up and in healthy state, you will be able to test your inference using the same code samples as in the [Online Inference (Server Mode)](#online-inference-server-mode) section. Make sure to use the correct model identifier based on your chosen quantization approach.
  </Step>
</Steps>

<Note>
  If you prefer to use your own storage for model weights, you can download them from your self-hosted storage (e.g., AWS S3, Google Cloud Storage) and mount the local path to the container using `-v /path/to/model:/mnt/model/` and `--model="/mnt/model/"` instead of the HuggingFace model identifier.
</Note>

## Next Steps

<CardGroup cols={3}>
  <Card title="Cloud Platform Deployment" icon="cloud" href="/docs/cloud-platform-deployment">
    Deploy on AWS, Google Cloud, or Azure for production workloads
  </Card>

  <Card title="Troubleshooting & Performance" icon="gauge-high" href="/docs/troubleshooting-performance">
    Optimize performance and resolve common deployment issues
  </Card>

  <Card title="API Reference" icon="code" href="/reference/jamba-1-6-api-ref">
    Learn about the complete API interface and parameters
  </Card>
</CardGroup>

## Resources

* [AI21 Jamba Mini on HuggingFace](https://huggingface.co/ai21labs/AI21-Jamba-Mini-1.7)
* [AI21 Jamba Large on HuggingFace](https://huggingface.co/ai21labs/AI21-Jamba-Large-1.7)
* [vLLM Documentation](https://docs.vllm.ai/en/latest/index.html)
* [ExpertsInt8 Quantization Details](https://github.com/vllm-project/vllm/pull/7415)
