Quantization
Quantization reduces model memory usage by representing weights with lower precision. Learn how to use quantization techniques with Jamba models for efficient inference and training.
Overview
Jamba models support several quantization techniques:
- FP8 Quantization: 8-bit floating point weights for reduced memory footprint and efficient deployment
- ExpertsInt8: Innovative quantization for MoE models in vLLM deployment
- 8-bit Quantization: Using BitsAndBytesConfig for training and inference
FP8 Quantization (vLLM)
These models leverage pre-quantized FP8 weights, significantly reducing storage requirements and memory footprint while not compromising output quality.
FP8 quantization requires Hopper architecture GPUs such as NVIDIA H100 and NVIDIA H200.
Pre-quantized Model Weights
Prerequisites
Implementation
Load Pre-quantized FP8 Model
Generate Text
Pre-quantized FP8 models require no additional quantization parameters since the weights are already quantized.
ExpertsInt8 Quantization (vLLM)
ExpertsInt8 is an innovative and efficient quantization technique developed specifically for Mixture of Experts (MoE) models deployed in vLLM, including Jamba models. This technique enables:
- Jamba Mini 1.7: Deploy on a single 80GB GPU
- Jamba Large 1.7: Deploy on a single node of 8x 80GB GPUs
Prerequisites
Implementation
Load Model with ExpertsInt8
Generate Text
With ExpertsInt8 quantization, you can fit prompts up to 100K tokens on a single 80GB A100 GPU with Jamba Mini.
8-bit Quantization (Hugging Face)
With 8-bit quantization using BitsAndBytesConfig, it is possible to fit up to 140K sequence length on a single 80GB GPU.
Prerequisites
Implementation
Configure 8-bit Quantization
Load Model with Quantization
Run Inference
To maintain model quality, we recommend excluding Mamba blocks from quantization using llm_int8_skip_modules=["mamba"]
.