Skip to main content

Installation

Hardware Requirements

TierVRAMCapability
Minimum8 GB1B–2B models at Q4 with small SAEs (16k width)
Recommended16–24 GB2B–9B models with wide SAEs (131k width)
Multi-GPU2×24 GB+Large models (27B+) with CPU offloading
GPU Memory Budget

The model, SAE, and KV cache all share GPU memory. A 9B model at Q4 (~6GB) plus a 131k-width SAE (~5GB) plus KV cache (~2-4GB) can fill a 24GB GPU. Monitor VRAM usage in the Dashboard.

# Clone the repository
git clone https://github.com/Onegaishimas/miLLM.git
cd miLLM

# Start all services
docker compose up -d

This starts:

  • PostgreSQL on port 5432
  • Redis on port 6379
  • Backend on port 8000
  • Admin UI on port 3000
  • Nginx on port 80

Access the admin UI at http://localhost or your configured domain.

Kubernetes

Kubernetes is the recommended deployment method for lab and production environments. The manifest at k8s/millm-deployment.yaml deploys the full miLLM stack into a dedicated millm namespace.

Architecture

┌──────────────────────────────────────────────────┐
│ Namespace: millm │
│ │
│ ┌──────────┐ ┌──────────┐ │
│ │ postgres │ │ redis │ (persistent storage)│
│ └──────────┘ └──────────┘ │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ millm-backend Pod (GPU node) │ │
│ │ └── backend (FastAPI + ML :8000) │ │
│ └──────────────────────────────────────┘ │
│ │
│ ┌────────────────────┐ │
│ │ millm-frontend │ (Admin UI/Nginx :80) │
│ └────────────────────┘ │
└──────────────────────────────────────────────────┘

NGINX Ingress
├── /api → millm-backend:8000
├── /v1 → millm-backend:8000 (OpenAI API)
├── /docs → millm-backend:8000 (Swagger)
├── /socket.io → millm-backend:8000 (WebSocket)
└── / → millm-frontend:80

Unlike miStudio, miLLM runs the backend as a single container — inference, SAE management, and WebSocket streaming all live in one process. An init container runs on startup to create and permission the data directories.

Prerequisites

Cluster requirements:

  • Kubernetes 1.25+ (MicroK8s, k3s, or full K8s)
  • NGINX Ingress Controller (ingressClassName: public)
  • NVIDIA Device Plugin for GPU scheduling
  • At least one node with an NVIDIA GPU and the NVIDIA Container Toolkit installed

Local tooling:

# Verify kubectl is connected to your cluster
kubectl cluster-info

# Verify NVIDIA device plugin is running
kubectl get pods -n kube-system | grep nvidia

# Verify GPU is schedulable
kubectl describe node <gpu-node> | grep nvidia.com/gpu

Step 1: Prepare Host Storage

miLLM uses hostPath volumes for persistent data. Create the required directories on the GPU node before deploying:

# Run on the GPU node (or via ssh)
sudo mkdir -p /data/millm/postgres
sudo mkdir -p /data/millm/redis
sudo mkdir -p /data/millm/data/model_cache
sudo mkdir -p /data/millm/data/sae_cache
sudo mkdir -p /data/millm/data/hf_cache
sudo chown -R 1000:1000 /data/millm

The init container (fix-permissions) also creates and permissions the cache directories automatically on each pod start, so this step is mainly to ensure the parent paths exist.

Storage sizing

model_cache holds downloaded model weights — plan for 5–30 GB per model depending on quantization. sae_cache holds SAE weights — wide SAEs (131k features) can be 4–6 GB each. Provision 500 GB+ for active use.

Step 2: Configure the Manifest

Open k8s/millm-deployment.yaml and update the following before applying:

Node selector — pin the GPU pod to your GPU node:

nodeSelector:
kubernetes.io/hostname: your-gpu-node-name # replace mcs-lnxgpu01

Domain names — update ingress hosts and hostAlias:

# In hostAliases:
- ip: "192.168.x.x" # Your GPU node IP
hostnames:
- "k8s-millm.yourdomain.com"

# In both Ingress objects (millm-ingress and millm-websocket-ingress):
- host: k8s-millm.yourdomain.com

Secrets — change default credentials before deploying to a shared environment:

# PostgreSQL (also update DATABASE_URL to match)
- name: POSTGRES_PASSWORD
value: "change-me"

- name: DATABASE_URL
value: "postgresql+asyncpg://millm:change-me@postgres:5432/millm"

CORS — add any additional origins that will access the API:

- name: CORS_ORIGINS
value: "http://k8s-millm.yourdomain.com,http://your-open-webui.yourdomain.com"

Step 3: Deploy

# Apply the full manifest
kubectl apply -f k8s/millm-deployment.yaml

# Watch pods come up
kubectl get pods -n millm -w

Expected output once healthy:

NAME                                READY   STATUS    RESTARTS   AGE
millm-backend-xxxxxxxxx-xxxxx 1/1 Running 0 90s
millm-frontend-xxxxxxxxx-xxxxx 1/1 Running 0 60s
postgres-xxxxxxxxx-xxxxx 1/1 Running 0 60s
redis-xxxxxxxxx-xxxxx 1/1 Running 0 60s
First-start database migration

The backend runs alembic upgrade head automatically on startup via the entrypoint script. Check backend logs if the pod restarts — a failed migration is the most common first-boot issue.

Step 4: Configure DNS

# On each client machine
echo "192.168.x.x k8s-millm.yourdomain.com" | sudo tee -a /etc/hosts

Then access the Admin UI at http://k8s-millm.yourdomain.com and the OpenAI-compatible API at http://k8s-millm.yourdomain.com/v1.

Verifying the Deployment

# Pod status
kubectl get pods -n millm

# Backend logs
kubectl logs -n millm deployment/millm-backend --tail=50

# Verify GPU is allocated to the backend
kubectl exec -n millm deployment/millm-backend -- nvidia-smi

# Confirm API health
curl http://k8s-millm.yourdomain.com/api/health

# Confirm OpenAI-compatible endpoint
curl http://k8s-millm.yourdomain.com/v1/models

Connecting Open WebUI

miLLM exposes a fully OpenAI-compatible API at /v1. To connect Open WebUI or any other compatible client:

  1. In Open WebUI → SettingsConnectionsOpenAI API
  2. Set the URL to http://k8s-millm.yourdomain.com/v1
  3. Leave the API key blank (or set any value — miLLM does not enforce key auth by default)
  4. Toggle the connection on and save

The model name shown in Open WebUI will match whichever model is currently loaded in miLLM.

Updating to New Images

miLLM publishes new images to DockerHub on every push to main. To update a running cluster:

# Pull and restart
kubectl rollout restart deployment/millm-backend -n millm
kubectl rollout restart deployment/millm-frontend -n millm

# Wait for completion
kubectl rollout status deployment/millm-backend -n millm --timeout=180s
kubectl rollout status deployment/millm-frontend -n millm --timeout=180s
Recreate strategy

The backend uses strategy: Recreate — the old pod terminates fully before the new one starts. This ensures the GPU and model weights are cleanly released before the new process takes over, preventing CUDA context conflicts.

WebSocket Configuration

miLLM uses Socket.IO for real-time inference progress and steering updates. The manifest includes a dedicated WebSocket ingress (millm-websocket-ingress) for the /socket.io path with extended timeouts:

AnnotationValuePurpose
websocket-servicesmillm-backendEnables WebSocket upgrade handling
proxy-http-version1.1Required for HTTP upgrade
proxy-read-timeout8640024-hour timeout for long inference sessions
proxy-send-timeout8640024-hour timeout for long inference sessions
Do not merge the WebSocket ingress

The WebSocket and HTTP ingresses are intentionally separate. Combining them with configuration-snippet Upgrade headers causes header duplication and broken connections.

Environment Variable Reference

VariableDefaultDescription
DATABASE_URLpostgresql+asyncpg://millm:millm@postgres:5432/millmAsync PostgreSQL connection
REDIS_URLredis://redis:6379/0Redis connection
MODEL_CACHE_DIR/data/model_cacheWhere downloaded model weights are stored
SAE_CACHE_DIR/data/sae_cacheWhere downloaded SAE weights are stored
HF_HOME/data/hf_cacheHuggingFace cache directory
CORS_ORIGINS(comma-separated URLs)Allowed origins for browser API access
LOG_LEVELINFOLogging verbosity: DEBUG, INFO, WARNING
LOG_FORMATjsonLog format: json or text
DEBUGfalseEnable FastAPI debug mode