Private AI
Troubleshooting & Performance Optimization
Resolve common issues and optimize performance for AI21’s Jamba model deployments
Overview
This guide helps you troubleshoot common deployment issues and optimize performance for AI21’s Jamba models across different deployment scenarios.
Before troubleshooting, ensure you’re using the recommended vLLM version v0.6.5
to v0.8.5.post1
.
Troubleshooting
Memory Issues
Out of Memory (OOM) Errors
Out of Memory (OOM) Errors
Symptoms:
- CUDA out of memory errors
- Process killed by system
Solutions:
Memory Fragmentation
Memory Fragmentation
Symptoms:
- Inconsistent OOM errors
- Memory usage appears lower than expected
Solutions:
Model Loading Issues
Slow Model Loading
Slow Model Loading
Storage Recommendations:
- Network Storage: >1 GB/s bandwidth
- RAM Disk: Load model from RAM if possible
Performance Issues
Chunked Prefill Optimization
Chunked Prefill Optimization
When to Use:
- Long input sequences (>8K tokens)
- High memory pressure during prefill
- Mixed sequence lengths in batches
Configuration:
Getting Help
Need Support?
If you need support, please contact our team at support@ai21.com with the following information:
Environment Details:
- Hardware specifications (GPU model, memory, CPU)
- Software versions (vLLM, CUDA, drivers)
- Full vLLM command
Diagnostics:
- Full error messages and stack traces
- GPU utilization logs (
nvidia-smi
output)