Overview
This guide helps you troubleshoot common deployment issues and optimize performance for AI21’s Jamba models across different deployment scenarios.Before troubleshooting, ensure you’re using the recommended vLLM version
v0.6.5
to v0.8.5.post1
.Troubleshooting
Memory Issues
Out of Memory (OOM) Errors
Out of Memory (OOM) Errors
Symptoms:
- CUDA out of memory errors
- Process killed by system
Memory Fragmentation
Memory Fragmentation
Symptoms:
- Inconsistent OOM errors
- Memory usage appears lower than expected
Model Loading Issues
Slow Model Loading
Slow Model Loading
Storage Recommendations:
- Network Storage: >1 GB/s bandwidth
- RAM Disk: Load model from RAM if possible
Performance Issues
Chunked Prefill Optimization
Chunked Prefill Optimization
When to Use:
- Long input sequences (>8K tokens)
- High memory pressure during prefill
- Mixed sequence lengths in batches
Getting Help
Need Support?
If you need support, please contact our team at support@ai21.com with the following information:Environment Details:
- Hardware specifications (GPU model, memory, CPU)
- Software versions (vLLM, CUDA, drivers)
- Full vLLM command
- Full error messages and stack traces
- GPU utilization logs (
nvidia-smi
output)