Skip to main content

Overview

This guide helps you troubleshoot common deployment issues and optimize performance for AI21’s Jamba models across different deployment scenarios.
Before troubleshooting, ensure you’re using the recommended vLLM version v0.6.5 to v0.8.5.post1.

Troubleshooting

Memory Issues

Symptoms:
  • CUDA out of memory errors
  • Process killed by system
Solutions:
--quantization="experts_int8"      # Reduces memory usage (recommended)
--tensor-parallel-size=8           # Number of GPUs to use (1-8)
--max-model-len=128000            # Reduce context length if needed (max 256K)
--gpu-memory-utilization=0.8    # Limits GPU memory usage
Symptoms:
  • Inconsistent OOM errors
  • Memory usage appears lower than expected
Solutions:
--max-num-seqs=50    # Controls memory per request (increase/decrease to tune)

Model Loading Issues

Storage Recommendations:
  • Network Storage: >1 GB/s bandwidth
  • RAM Disk: Load model from RAM if possible

Performance Issues

When to Use:
  • Long input sequences (>8K tokens)
  • High memory pressure during prefill
  • Mixed sequence lengths in batches
Configuration:
--enable-chunked-prefill           # Enables chunked prefill
--max-num-batched-tokens=8192     # Adjust based on GPU memory

Getting Help

Need Support?

If you need support, please contact our team at support@ai21.com with the following information:Environment Details:
  • Hardware specifications (GPU model, memory, CPU)
  • Software versions (vLLM, CUDA, drivers)
  • Full vLLM command
Diagnostics:
  • Full error messages and stack traces
  • GPU utilization logs (nvidia-smi output)
I