Troubleshooting

Common Issues

Symptom	Cause	Fix
OOM during model load	Model too large for GPU	Use more aggressive quantization (Q4 or Q2), or use `device_map=auto` for CPU offload
OOM during inference	KV cache + model + SAE exceeds VRAM	Reduce `max_tokens`, use smaller SAE width, or unload unused SAEs
OOM with hybrid/Mamba model	`mamba-ssm` not installed, naive fallback creates 22GB+ tensor	Install `mamba-ssm` package, or set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
Steering has no effect	Hook not firing on model architecture	Check logs for `[Steering Hook] FIRED` messages. Some architectures return single tensors instead of tuples.
Steering outputs identical at all strengths	Model state not reset between generations	Ensure `disable_cache=True` during steered generation
Model load fails with "TokenizersBackend"	Custom tokenizer class not available	Model needs specific tokenizer package. Try with `trust_remote_code=True`
Model load fails with "BitNet"	Pre-quantized model conflicts with bitsandbytes	Pre-quantized models (BitNet, GPTQ) should not use bitsandbytes. miLLM auto-detects and skips.
500 errors on `/v1/chat/completions`	Model crashed or leaked GPU memory	Restart the backend pod to free leaked VRAM, then reload the model
SAE attach fails "dimension mismatch"	SAE d_in doesn't match model hidden dim at target layer	Ensure the SAE was trained for this model and layer
WebSocket disconnects	Long-running inference blocks the event loop	Normal behavior — WebSocket reconnects automatically
Labeling job gets 500s from miLLM	Model OOMing during inference	Switch to a smaller model or increase quantization

GPU Memory Troubleshooting

Check GPU memory usage:

nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv,noheader

Check what processes hold GPU memory:

nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv,noheader

If memory is leaked (no model shows as loaded but VRAM is used), restart the backend pod:

kubectl delete pod -n millm <pod-name>

Useful Environment Variables

Variable	Purpose	Default
`PYTORCH_CUDA_ALLOC_CONF`	Set to `expandable_segments:True` for hybrid models	Not set
`MODEL_CACHE_DIR`	Where models are stored	`/data/model_cache`
`SAE_CACHE_DIR`	Where SAEs are stored	`/data/sae_cache`
`LOG_LEVEL`	Backend log verbosity	`INFO`
`CORS_ORIGINS`	Allowed CORS origins	`http://localhost`

Common Issues​

GPU Memory Troubleshooting​

Useful Environment Variables​

Common Issues

GPU Memory Troubleshooting

Useful Environment Variables