Skip to main content

Troubleshooting

Common Issues

SymptomCauseFix
OOM during model loadModel too large for GPUUse more aggressive quantization (Q4 or Q2), or use device_map=auto for CPU offload
OOM during inferenceKV cache + model + SAE exceeds VRAMReduce max_tokens, use smaller SAE width, or unload unused SAEs
OOM with hybrid/Mamba modelmamba-ssm not installed, naive fallback creates 22GB+ tensorInstall mamba-ssm package, or set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Steering has no effectHook not firing on model architectureCheck logs for [Steering Hook] FIRED messages. Some architectures return single tensors instead of tuples.
Steering outputs identical at all strengthsModel state not reset between generationsEnsure disable_cache=True during steered generation
Model load fails with "TokenizersBackend"Custom tokenizer class not availableModel needs specific tokenizer package. Try with trust_remote_code=True
Model load fails with "BitNet"Pre-quantized model conflicts with bitsandbytesPre-quantized models (BitNet, GPTQ) should not use bitsandbytes. miLLM auto-detects and skips.
500 errors on /v1/chat/completionsModel crashed or leaked GPU memoryRestart the backend pod to free leaked VRAM, then reload the model
SAE attach fails "dimension mismatch"SAE d_in doesn't match model hidden dim at target layerEnsure the SAE was trained for this model and layer
WebSocket disconnectsLong-running inference blocks the event loopNormal behavior — WebSocket reconnects automatically
Labeling job gets 500s from miLLMModel OOMing during inferenceSwitch to a smaller model or increase quantization

GPU Memory Troubleshooting

Check GPU memory usage:

nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv,noheader

Check what processes hold GPU memory:

nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv,noheader

If memory is leaked (no model shows as loaded but VRAM is used), restart the backend pod:

kubectl delete pod -n millm <pod-name>

Useful Environment Variables

VariablePurposeDefault
PYTORCH_CUDA_ALLOC_CONFSet to expandable_segments:True for hybrid modelsNot set
MODEL_CACHE_DIRWhere models are stored/data/model_cache
SAE_CACHE_DIRWhere SAEs are stored/data/sae_cache
LOG_LEVELBackend log verbosityINFO
CORS_ORIGINSAllowed CORS originshttp://localhost