* feat(distributed): add configurable NATS backend install/upgrade timeouts Adds BackendInstallTimeout and BackendUpgradeTimeout to DistributedConfig with 15m defaults, following the existing MCPToolTimeout / WorkerWaitTimeout pattern. These will replace the hardcoded literals in RemoteUnloaderAdapter so admin-driven backend installs across the cluster survive long OCI image pulls that previously timed out at 3m. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(distributed): gofmt alignment after timeout fields Re-aligns the Validate() negative-duration map and the Default* const block so the new BackendInstall/UpgradeTimeout entries do not leave the surrounding columns mis-padded. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(cli): surface LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT and _UPGRADE_TIMEOUT Parses the two new env vars on the run CLI and threads them through the existing AppOption builder so DistributedConfig picks them up. Invalid duration strings now fail loudly at startup rather than silently falling back to the default. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): inject NATS install/upgrade timeouts into RemoteUnloaderAdapter Removes the hardcoded 3m / 15m literals from RemoteUnloaderAdapter and threads in DistributedConfig.BackendInstallTimeoutOrDefault() and BackendUpgradeTimeoutOrDefault() at construction. Install now defaults to 15m (was 3m); cold OCI image pulls on Jetson Wi-Fi routinely blew past the old ceiling. Scripted messaging client captures the timeout so tests can assert the configured value actually reaches the NATS request. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): introduce galleryop.ErrWorkerStillInstalling sentinel When the NATS request-reply for backend.install (or .upgrade) times out the worker is almost always still pulling the OCI image. Wrap the timeout in a typed sentinel so the manager above can distinguish "worker hung" from "worker still working" and leave the pending_backend_ops row in place for the reconciler to confirm via backend.list. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): treat NATS install timeout as in-progress, not failure When a worker times out replying to backend.install but the install is still running on the worker, enqueueAndDrainBackendOp now reports a running_on_worker status and pushes NextRetryAt out by the install timeout so the reconciler does not immediately re-fire another install while the worker is still pulling the image. The pending_backend_ops row stays in place for the next reconciler pass to confirm via backend.list. InstallBackend wraps the result in galleryop.ErrWorkerStillInstalling so callers can branch (galleryop renders yellow in-progress instead of red error). UpgradeBackend uses the same wrap. Adds RemoteUnloaderAdapter.InstallTimeout() so the manager can push NextRetryAt by the configured timeout without reaching into a private field, and NodeRegistry.RecordPendingBackendOpInFlight as the soft cousin of RecordPendingBackendOpFailure. Also includes incidental gofmt-driven struct-field alignment in registry.go on lines unrelated to the change (touched files are re-formatted to canonical form per project policy). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed): don't increment Attempts on in-flight install timeout An in-flight timeout (worker still pulling the OCI image) is not a failed attempt, it's a delayed one. Incrementing Attempts let genuinely-progressing slow installs (e.g. 30 GB CUDA images on Wi-Fi) trip the reconciler's maxPendingBackendOpAttempts cap and dead-letter the queue row while the worker was still legitimately working. RecordPendingBackendOpInFlight now only updates LastError and NextRetryAt. Also documents "running_on_worker" in the NodeOpStatus.Status enum comment so Task 6 implementers see the full surface. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): surface ErrWorkerStillInstalling as non-error OpStatus When the distributed backend manager returns an error that wraps ErrWorkerStillInstalling, backendHandler now completes the op with a "still installing in background" message rather than marking it as a red failure. Admin UI sees a yellow in-progress state; reconciler confirms completion on its next pass. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): end-to-end install-timeout-then-reconcile Wires Task 1-6 end-to-end so any seam mismatch surfaces in CI rather than during a real cluster install. NATS times out, the queue row stays alive with running_on_worker status, the worker eventually reports the backend installed via backend.list, the manager surfaces it via ListBackends. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT / _UPGRADE_TIMEOUT Add the two new operator-tunable env vars to the Frontend Configuration table in the distributed-mode docs. Explains the 15m default, when to raise it (slow links pulling multi-GB OCI images), and the new "still installing in background" admin-UI state when the round-trip times out but the worker is still working. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): clear pending install rows when backend.list confirms DistributedBackendManager.ListBackends now proactively clears pending_backend_ops install rows whose (nodeID, backend) is reported installed by backend.list. Operator UI updates immediately instead of waiting up to installTimeout (default 15m) for the next reconciler tick after NextRetryAt. Only install rows are cleared; upgrade and delete intents are not satisfied by presence in backend.list and continue to drain through their normal reconciler paths. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(messaging): add BackendInstallProgressEvent wire type and subject New NATS subject nodes.<nodeID>.backend.install.<opID>.progress lets the worker publish transient progress events (file, current/total bytes, percentage, phase) while a long-running install pulls its OCI image. BackendInstallRequest gains an optional OpID field so the worker knows which subject to publish on. Transient pub/sub (not JetStream): the install reply remains ground truth for success/failure; dropped progress events are tolerable. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(messaging): drop em-dash from BackendInstallProgress test comment Per project convention (no em-dashes anywhere). Comment substance is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): worker publishes debounced install progress over NATS When BackendInstallRequest.OpID is set, the worker's backend.install handler wires a debounced publisher (250ms window) into the gallery download callback. Each tick becomes a BackendInstallProgressEvent on nodes.<nodeID>.backend.install.<opID>.progress; the publisher always emits a final event on Flush so the UI sees the terminal percentage. Old masters that do not set OpID continue to run silent installs: no behavior change for them. Lock ordering: the publisher releases its mutex before calling messaging.Publish so a slow network never stalls the install loop. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): RemoteUnloaderAdapter subscribes to install progress InstallBackend gains opID + onProgress parameters. When both are set, the adapter subscribes to nodes.<nodeID>.backend.install.<opID>.progress BEFORE publishing the install request, decodes each message into the caller's onProgress callback in a goroutine (so a slow callback never stalls the NATS reader thread), and unsubscribes after RequestJSON returns. When onProgress is nil OR opID is empty (the reconciler retry path), subscription is skipped entirely - silent installs cost nothing extra. Subscribe failure is logged at Warn and the install proceeds without progress streaming; the NATS round-trip still owns terminal status. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): forward backend install progress into galleryop OpStatus DistributedBackendManager.InstallBackend now passes the gallery op ID and a progress bridge into the adapter call. Each BackendInstallProgressEvent from the worker becomes a galleryop.ProgressCallback tick - which the existing backendHandler already turns into OpStatus.UpdateStatus, so the admin UI/SSE polling sees per-byte progress for distributed installs without any UI-side change. UpgradeBackend is intentionally left silent for now: its wire request (BackendUpgradeRequest) does not carry OpID, and rolling-update fallback is the rarer path. Will be picked up in a follow-up if the worker upgrade path also gets a progress channel. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): InstallBackend tolerates silent (pre-Phase-2) workers A worker on pre-Phase-2 code never publishes progress events. The new master subscribes optimistically; this spec pins that a silent worker still produces a green install with no progressCb ticks. The install reply is the source of truth for terminal state; the progress stream is a best-effort UX enrichment. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document install progress streaming Note the new nodes.<nodeID>.backend.install.<opID>.progress subject and the silent-worker compatibility behavior so operators know to expect real-time progress and what happens on a mixed-version cluster. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): note progress-event ordering trade-off in InstallBackend Document near the goroutine dispatch why ordering at the consumer is best-effort, why it rarely matters in practice (worker debounce >> goroutine jitter), and what a future hardening pass would look like (Seq field + stale-by-seq drop). Stops the next reader from accidentally "fixing" the goroutine pool away. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): add NodeProgress + OpStatus.Nodes for per-node breakdown Adds the data model the UI needs to render an expandable per-node breakdown of a fanned-out backend install. NodeProgress carries node identity (ID + name), per-node status (queued / running_on_worker / success / error / downloading), the current file + bytes + percentage from the Phase 2 progress stream, and any per-node error. OpStatus.Nodes is the slice the /api/operations handler will surface in a follow-up. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): UpdateNodeProgress merges per-node ticks by NodeID GalleryService.UpdateNodeProgress(opID, nodeID, np) merges a NodeProgress into OpStatus.Nodes (keyed by NodeID, no duplicates) and mirrors the latest tick into the aggregate Progress / FileName / DownloadedFileSize / TotalFileSize fields so the legacy single-bar OperationsBar view keeps working unchanged alongside the new per-node breakdown. Concurrent-safe via the existing g.Mutex. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): write per-node OpStatus entries during install fan-out DistributedBackendManager now accepts a nodeProgressSink and feeds it two streams: 1. enqueueAndDrainBackendOp emits a per-node terminal entry on each status it appends to BackendOpResult (queued, success, error, running_on_worker). The opID is threaded through the function so the sink gets the right gallery op identity. 2. The install apply closure fans each BackendInstallProgressEvent into the sink as a downloading entry, alongside the legacy progressCb path so the aggregate single-bar view stays correct. Production wiring passes the GalleryService (which implements UpdateNodeProgress via Task 2) as the sink. Single-node tests pass nil. DeleteBackend and UpgradeBackend pass an empty opID so the sink path no-ops for ops that aren't gallery-tracked the same way as Install. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(operations): expose per-node breakdown on /api/operations When an operation's OpStatus has Nodes entries (populated by the Phase 4 progress sink wiring), surface them as a "nodes" array on the /api/operations response, sorted by node_name for stable rendering. Backward compatible: legacy clients ignore the field; ops without any node entries (single-node mode, model installs) omit the array entirely thanks to the empty-slice guard. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): per-node breakdown in OperationsBar When an install op fans out to more than one worker, the operations bar now shows a "N nodes" chevron that expands into a per-node list. Each row carries the node's status (color-coded pill), the current file being downloaded, byte counts, percentage, and a thin per-node progress bar. Yellow "Worker busy" pill marks running_on_worker status with a tooltip explaining the NATS round-trip timed out but the worker is still installing in the background. Backward compatible: ops without a nodes field (legacy or single-node mode) render as before. State for expand/collapse is local to the component, keyed by jobID/id - reload starts collapsed. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document per-node breakdown in the operations bar Adds a short subsection covering the expandable "N nodes" chevron in the OperationsBar admin UI, the meaning of each status pill, and how it relates to the /api/operations nodes array. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(galleryop): UpdateStatus preserves Nodes when caller sends none Real-world bug surfaced by the Phase 4 multi-worker smoke test: the nodes[] array in /api/operations flickered between a single node at a time on a 2-worker install. Root cause: the Phase 2 progress bridge also calls the legacy progressCb -> UpdateStatus(&OpStatus{...}) on every tick. UpdateStatus then overwrote the entire status pointer, wiping the Nodes slice that UpdateNodeProgress had just merged in. Fix: in UpdateStatus, if the incoming op has an empty Nodes slice, carry forward the previous status's Nodes before storing. Callers that explicitly populate Nodes still win (their slice replaces the prior one, no merge across the two code paths). Two regression specs added pinning both directions of the contract. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): strip implementation details from user-facing docs Trim the new install/upgrade timeout rows and the install-progress sections to focus on what the operator sees and tunes. Drops: - the NATS subject names and pub/sub mechanics - "round-trip" / reconciler / backend.list jargon - /api/operations polling cadence - "pre-2026-05-22" version references Reframes the breakdown text around the admin UI (Operations Bar, chevron, status pills, "Worker busy" tooltip). Implementation context lives in the agent notes and code comments. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(config): move DistributedConfig.Validate flag names to constants The negative-duration check map was a wall of literal kebab-case strings that had to stay in sync with the kong-derived CLI flag names manually. Move them to a Flag* const block alongside the existing Default* block so a rename of either the Go field or the CLI naming convention forces a compile error rather than silent drift. Sole consumer today is Validate; the constants are exported so future operator-facing surfaces (e.g. error messages on other validation paths) can reference them by name instead of repeating the literals. Tests pin both the literal values (so a future "let's just rename this" doesn't accidentally regress the CLI flag) and the negative- duration error message for the new BackendInstall / BackendUpgrade fields. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(distributed): extract NodeStatus and Phase enums to constants Sweep for the same literal-string-as-identifier pattern called out on the Validate flag names: the per-node install status enum ("queued" | "downloading" | "running_on_worker" | "success" | "error") appeared as raw literals across managers_distributed.go (10+ sites, including 3 separate `n.Status == "running_on_worker"` checks), operation.go, and the test suite. Same shape for the Phase enum ("resolving" | "downloading" | "extracting" | "starting") in the worker-side progress publisher. Promote both to exported const blocks: - galleryop.NodeStatus{Queued,Downloading,RunningOnWorker,Success,Error} shared between galleryop.NodeProgress.Status (the wire field) and nodes.NodeOpStatus.Status (the in-process per-node summary) - messaging.Phase{Resolving,Downloading,Extracting,Starting} shared between the worker publisher and any future consumer that needs to switch on phase Tests pin both the literal values (so a future "let's just rename" doesn't silently change the JSON wire) and use the constants in setup (so the producer side stays drift-protected). Wire-format assertions on the /api/operations JSON output keep their literals deliberately, so the constant value can never silently diverge from what the UI receives. Out of scope for this PR (separate cleanup): the finetune and quantization job-status enums have the same anti-pattern with 14+ literal sites each, but predate this PR's work. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
29 KiB
+++ disableToc = false title = "Distributed Mode" weight = 14 url = "/features/distributed-mode/" +++
Distributed mode enables horizontal scaling of LocalAI across multiple machines using PostgreSQL for state and node registry, and NATS for real-time coordination. Unlike the [P2P/federation approach]({{% relref "features/distributed_inferencing" %}}), distributed mode is designed for production deployments and Kubernetes environments where you need centralized management, health monitoring, and deterministic routing.
{{% notice note %}} Distributed mode requires authentication enabled with a PostgreSQL database — SQLite is not supported. This is because the node registry, job store, and other distributed state are stored in PostgreSQL tables. {{% /notice %}}
Architecture Overview
┌─────────────────┐
│ Load Balancer │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌───────▼──────┐ ┌────▼─────┐ ┌─────▼──────┐
│ Frontend #1 │ │ Frontend │ │ Frontend #N│
│ (LocalAI) │ │ #2 │ │ (LocalAI) │
└──────┬───────┘ └────┬─────┘ └─────┬──────┘
│ │ │
┌───────▼──────────────▼──────────────▼───────┐
│ PostgreSQL + NATS │
│ (node registry, jobs, coordination) │
└───────┬──────────────┬──────────────┬───────┘
│ │ │
┌──────▼──────┐ ┌────▼─────┐ ┌─────▼──────┐
│ Worker #1 │ │ Worker │ │ Worker #N │
│ (generic) │ │ #2 │ │ (generic) │
└─────────────┘ └──────────┘ └────────────┘
Frontends are stateless LocalAI instances that receive API requests and route them to worker nodes via the SmartRouter. All frontends share state through PostgreSQL and coordinate via NATS.
Workers are generic processes that self-register with a frontend. They don't have a fixed backend type — the SmartRouter dynamically installs the required backend via NATS backend.install events when a model request arrives.
Scheduling Algorithm
The SmartRouter uses idle-first scheduling with preemptive eviction:
- If the model is already loaded on a node → use it (per-model gRPC address)
- If no node has the model → prefer nodes with enough free VRAM
- Fall back to idle nodes (zero models), then least-loaded nodes
- If no node has capacity → evict the least-recently-used model with zero in-flight requests to free a node
- If all models are busy → wait (with timeout) for a model to become idle, then evict
- Send
backend.installNATS event with backend name + model ID → worker starts a new gRPC process on a dynamic port - SmartRouter calls gRPC
LoadModelon the model-specific port, records in DB
Each model gets its own gRPC backend process, so a single worker can serve multiple models simultaneously (e.g., a chat model and an embedding model).
Prerequisites
- PostgreSQL (with pgvector extension recommended for RAG) — used for node registry, job store, auth, and shared state
- NATS server — used for real-time backend lifecycle events and file staging
- All services must be on the same network (or reachable via configured URLs)
Quick Start with Docker Compose
The easiest way to try distributed mode locally is with the provided Docker Compose file:
docker compose -f docker-compose.distributed.yaml up
This starts PostgreSQL, NATS, a LocalAI frontend, and one worker node. When you send an inference request, the SmartRouter automatically installs the needed backend on the worker and loads the model. See the file for details on adding GPU support, shared volumes, and additional workers.
{{% notice tip %}}
Use docker-compose.distributed.yaml for quick local testing. For production, deploy PostgreSQL and NATS as managed services and run frontends/workers on separate hosts.
{{% /notice %}}
Frontend Configuration
The frontend is a standard LocalAI instance with distributed mode enabled. These flags are added to the local-ai run command:
| Flag | Env Var | Default | Description |
|---|---|---|---|
--distributed |
LOCALAI_DISTRIBUTED |
false |
Enable distributed mode |
--instance-id |
LOCALAI_INSTANCE_ID |
auto UUID | Unique instance ID for this frontend |
--nats-url |
LOCALAI_NATS_URL |
(required) | NATS server URL (e.g., nats://localhost:4222) |
--registration-token |
LOCALAI_REGISTRATION_TOKEN |
(empty) | Token that workers must provide to register |
--auto-approve-nodes |
LOCALAI_AUTO_APPROVE_NODES |
false |
Auto-approve new worker nodes (skip admin approval) |
--auth |
LOCALAI_AUTH |
false |
Must be true for distributed mode |
--auth-database-url |
LOCALAI_AUTH_DATABASE_URL |
(required) | PostgreSQL connection URL |
--backend-install-timeout |
LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT |
15m |
How long the frontend waits for a worker to acknowledge a backend install before considering the request stalled. Raise it when workers pull large backend images over slow links. If a worker takes longer than this, the operation shows as "still installing in background" in the admin UI and clears once the worker finishes. |
--backend-upgrade-timeout |
LOCALAI_NATS_BACKEND_UPGRADE_TIMEOUT |
15m |
Same as the install timeout, applied to backend upgrades (force-reinstall). |
Optional: S3 Object Storage
For multi-host deployments where workers don't share a filesystem, S3-compatible storage enables distributed file transfer (model files, configs):
| Flag | Env Var | Default | Description |
|---|---|---|---|
--storage-url |
LOCALAI_STORAGE_URL |
(empty) | S3 endpoint URL (e.g., http://minio:9000) |
--storage-bucket |
LOCALAI_STORAGE_BUCKET |
localai |
S3 bucket name |
--storage-region |
LOCALAI_STORAGE_REGION |
us-east-1 |
S3 region |
--storage-access-key |
LOCALAI_STORAGE_ACCESS_KEY |
(empty) | S3 access key |
--storage-secret-key |
LOCALAI_STORAGE_SECRET_KEY |
(empty) | S3 secret key |
When S3 is not configured, model files are transferred directly from the frontend to workers via HTTP — no shared filesystem needed. Each worker runs a small HTTP file transfer server alongside the gRPC backend process. This is the default and works out of the box.
For high-throughput or very large model files, S3 can be more efficient since it avoids streaming through the frontend.
Watching Backend Installs
While a worker downloads a backend, the admin Operations Bar at the top of the UI shows real-time progress: current file, downloaded/total bytes, and percentage. This works the same as single-node mode.
When an install targets more than one worker, an N nodes chevron appears on the operation row. Click it to expand a per-node breakdown, with one row per worker showing:
- A status pill: Queued (gray), Downloading (blue), Worker busy (yellow), Done (green), or Failed (red).
- The file currently being downloaded with current/total bytes and percentage.
- A thin per-node progress bar.
- Any error returned by the worker.
The yellow Worker busy pill means the worker took longer than
--backend-install-timeout to acknowledge but is most likely still
working in the background. The admin UI clears it as soon as the worker
finishes; no action is required from the operator.
If a worker is running an older LocalAI release that does not report progress, its row in the breakdown will still show terminal status (queued / done / failed / worker busy) but no per-file progress.
Worker Configuration
Workers are started with the worker subcommand. Each worker is generic — it doesn't need a backend type at startup:
local-ai worker \
--register-to http://frontend:8080 \
--registration-token changeme \
--nats-url nats://nats:4222
| Flag | Env Var | Default | Description |
|---|---|---|---|
--addr |
LOCALAI_SERVE_ADDR |
0.0.0.0:50051 |
gRPC listen address |
--advertise-addr |
LOCALAI_ADVERTISE_ADDR |
(auto) | Address the frontend uses to reach this node (see below) |
--http-addr |
LOCALAI_HTTP_ADDR |
gRPC port - 1 | HTTP file transfer server bind address |
--advertise-http-addr |
LOCALAI_ADVERTISE_HTTP_ADDR |
(auto) | HTTP address the frontend uses for file transfer |
--register-to |
LOCALAI_REGISTER_TO |
(required) | Frontend URL for self-registration |
--node-name |
LOCALAI_NODE_NAME |
hostname | Human-readable node name |
--registration-token |
LOCALAI_REGISTRATION_TOKEN |
(empty) | Token to authenticate with the frontend |
--heartbeat-interval |
LOCALAI_HEARTBEAT_INTERVAL |
10s |
Interval between heartbeat pings |
--nats-url |
LOCALAI_NATS_URL |
(required) | NATS URL for backend installation and file staging |
--backends-path |
LOCALAI_BACKENDS_PATH |
./backends |
Path to backend binaries |
--models-path |
LOCALAI_MODELS_PATH |
./models |
Path to model files |
{{% notice tip %}}
Advertise address: The --addr flag is the local bind address for gRPC. The --advertise-addr is the address the frontend stores and uses to reach the worker via gRPC. If not set, the worker auto-derives it by replacing 0.0.0.0 with the OS hostname (which in Docker is the container ID, resolvable via Docker DNS). Set --advertise-addr explicitly when the auto-detected hostname is not routable from the frontend (e.g., in Kubernetes, use the pod's service DNS name).
HTTP file transfer: Each worker also runs a small HTTP server for file transfer (model files, configs). By default it listens on the gRPC base port - 1 (e.g., if gRPC base is 50051, HTTP is on 50050). gRPC ports grow upward from the base port as additional models are loaded. Set --advertise-http-addr if the auto-detected address is not routable from the frontend.
{{% /notice %}}
Worker Address Configuration
The simplest way to configure a worker's network address is with a single variable:
| Variable | Description |
|---|---|
LOCALAI_ADDR |
Reachable address of this worker (host:port). The port is used as the base for gRPC backend processes, and port-1 for the HTTP file transfer server. |
Example:
environment:
LOCALAI_ADDR: "192.168.1.100:50051"
LOCALAI_NATS_URL: "nats://frontend:4222"
LOCALAI_REGISTER_TO: "http://frontend:8080"
LOCALAI_REGISTRATION_TOKEN: "my-secret"
For advanced networking scenarios (NAT, load balancers, separate gRPC/HTTP ports), the following override variables are available:
| Variable | Description | Default |
|---|---|---|
LOCALAI_SERVE_ADDR |
gRPC base port bind address | 0.0.0.0:50051 |
LOCALAI_HTTP_ADDR |
HTTP file transfer bind address | 0.0.0.0:{gRPC port - 1} |
LOCALAI_ADVERTISE_ADDR |
Public gRPC address (if different from LOCALAI_ADDR) |
Derived from LOCALAI_ADDR |
LOCALAI_ADVERTISE_HTTP_ADDR |
Public HTTP address (if different from gRPC host) | Derived from advertise host + HTTP port |
NVIDIA GPU support
When running workers in a container, two runtime settings affect how VRAM usage is reported back to the frontend:
-
NVIDIA_DRIVER_CAPABILITIESmust includeutility. Without it, the NVML library (and thereforenvidia-smi) is not available inside the container. CUDA compute still works, but the worker cannot query free VRAM and the Nodes page will show the node as fully used. SetNVIDIA_DRIVER_CAPABILITIES=compute,utility(or, with the NVIDIA CDI runtime, listcapabilities: [gpu, utility]on the device reservation). -
Run the container with
init: true(ordocker run --init). The worker process becomes PID 1 in the container and cannot reap zombies on its own. Without an init,nvidia-smicalls can fail intermittently withwaitid: no child processes, which briefly clears free-VRAM metrics.
Unified memory devices (Jetson, DGX Spark / GB10, Thor): these SoCs
share one physical RAM between CPU and GPU. LocalAI detects them via
/sys/devices/soc0/family and /sys/devices/soc0/soc_id (no nvidia-smi
required) and reports system-RAM figures as VRAM. Free VRAM therefore tracks
MemAvailable in /proc/meminfo.
Node Labels
Workers can declare labels at startup for scheduling constraints:
| Variable | Description | Example |
|---|---|---|
LOCALAI_NODE_LABELS |
Comma-separated key=value labels |
tier=premium,gpu=a100,zone=us-east |
Labels can also be managed via the admin API (see Label Management API below).
The system automatically applies hardware-detected labels on registration:
gpu.vendor-- GPU vendor (nvidia, amd, intel, vulkan)gpu.vram-- GPU VRAM bucket (8GB, 16GB, 24GB, 48GB, 80GB+)node.name-- The node's registered name
How Workers Operate
Workers start as generic processes with no backend installed. When the SmartRouter needs to load a model on a worker, it sends a NATS backend.install event with the backend name and model ID. The worker:
- Installs the backend from the gallery (if not already installed)
- Starts a new gRPC backend process on a dynamic port (each model gets its own process)
- Replies with the allocated gRPC address
- The SmartRouter calls
LoadModelvia direct gRPC to that address
Workers can run multiple models concurrently — each model gets its own gRPC process on a separate port. For example, an embedding model on port 50051 and a chat model on port 50052 can run simultaneously on the same worker.
When the SmartRouter needs to free capacity, it can unload models with zero in-flight requests without affecting other models on the same worker.
Node Management API
The API is split into two prefixes with distinct auth:
/api/node/ — Node self-service
Used by workers themselves (registration, heartbeat, etc.). Authenticated via the registration token, exempt from global auth.
| Method | Path | Description |
|---|---|---|
POST |
/api/node/register |
Register a new worker |
POST |
/api/node/:id/heartbeat |
Update heartbeat timestamp |
POST |
/api/node/:id/drain |
Mark self as draining |
GET |
/api/node/:id/models |
Query own loaded models |
DELETE |
/api/node/:id |
Deregister self |
/api/nodes/ — Admin management
Used by the WebUI and admin API consumers. Requires admin authentication.
| Method | Path | Description |
|---|---|---|
GET |
/api/nodes |
List all registered workers |
GET |
/api/nodes/:id |
Get a single worker by ID |
GET |
/api/nodes/:id/models |
List models loaded on a worker |
DELETE |
/api/nodes/:id |
Admin-delete a worker |
POST |
/api/nodes/:id/drain |
Admin-drain a worker |
POST |
/api/nodes/:id/approve |
Approve a pending worker node |
POST |
/api/nodes/:id/backends/install |
Install a backend on a worker |
POST |
/api/nodes/:id/backends/delete |
Delete a backend from a worker |
POST |
/api/nodes/:id/models/unload |
Unload a model from a worker |
POST |
/api/nodes/:id/models/delete |
Delete model files from a worker |
The Nodes page in the React WebUI provides a visual overview of all registered workers, their statuses, and loaded models.
Node Approval
By default, new worker nodes start in pending status and must be approved by an admin before they can receive traffic. This prevents unknown machines from joining the cluster.
To approve a pending node via the API:
curl -X POST http://frontend:8080/api/nodes/<node-id>/approve \
-H "Authorization: Bearer <admin-token>"
The Nodes page in the WebUI also shows pending nodes with an Approve button.
To skip manual approval and let nodes join immediately, set --auto-approve-nodes (or LOCALAI_AUTO_APPROVE_NODES=true) on the frontend. This is convenient for development and trusted environments.
Node Statuses
| Status | Meaning |
|---|---|
pending |
Node registered but waiting for admin approval (when --auto-approve-nodes is false) |
healthy |
Node is active and responding to heartbeats |
unhealthy |
Node has missed heartbeats beyond the threshold (detected by the HealthMonitor) |
offline |
Node is temporarily offline (graceful shutdown or stale heartbeat). The node row is preserved so re-registration restores the previous approval status without requiring re-approval |
draining |
Node is shutting down gracefully — no new requests are routed to it, existing in-flight requests are allowed to complete |
Agent Workers
Agent workers are dedicated processes for executing agent chats and MCP CI jobs. Unlike backend workers (which run gRPC model inference), agent workers use cogito to orchestrate multi-step conversations with tool calls.
local-ai agent-worker \
--register-to http://frontend:8080 \
--nats-url nats://nats:4222 \
--registration-token changeme
Agent workers:
- Execute agent chat messages dispatched via NATS
- Run MCP CI jobs (with access to MCP servers via docker)
- Handle MCP tool discovery and execution requests from the frontend
- Get auto-provisioned API keys during registration for calling the inference API
In the docker-compose setup, the agent worker mounts the Docker socket so it can run MCP stdio servers (e.g., docker run commands):
agent-worker-1:
command: agent-worker
volumes:
- /var/run/docker.sock:/var/run/docker.sock
MCP in Distributed Mode
MCP servers configured in model configs work in distributed mode. The frontend routes MCP operations through NATS to agent workers:
- MCP discovery (
GET /v1/mcp/servers/:model): routed to agent workers which create sessions and return server info - MCP tool execution (during
/v1/chat/completions): tool calls are routed to agent workers via NATS request-reply - MCP CI jobs: executed entirely on agent workers with access to docker for stdio-based MCP servers
vLLM Multi-Node (Data-Parallel)
A single vLLM model can span multiple GPU nodes via data parallelism: the head node serves the OpenAI API and runs the local DP ranks, follower nodes run vanilla vllm serve --headless and speak ZMQ directly to the head. LocalAI's role is starting the follower processes and surfacing them in the admin UI; the cross-rank tensor traffic is vLLM's own.
This mode is operator-launched — the head config and each follower's invocation must agree on the topology (data_parallel_size, data_parallel_size_local, data_parallel_address, data_parallel_rpc_port). The SmartRouter does not place follower ranks automatically.
Head node configuration
The head runs the existing single-node vLLM gRPC backend. Set engine_args to publish the DP topology vLLM expects:
backend: vllm
parameters:
model: moonshotai/Kimi-K2.6-Instruct
engine_args:
data_parallel_size: 4 # total ranks across all nodes
data_parallel_size_local: 2 # ranks on the head node
data_parallel_address: 10.0.0.1 # head's reachable IP
data_parallel_rpc_port: 32100 # any free port; followers connect here
enable_expert_parallel: true # for MoE models
The head will start its 2 local ranks, listen on 10.0.0.1:32100, and wait for the remaining 2 ranks to handshake.
Follower nodes
Each follower runs local-ai p2p-worker vllm with matching topology, an explicit start rank, and the head's address:
local-ai p2p-worker vllm \
moonshotai/Kimi-K2.6-Instruct \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--start-rank 2 \
--master-addr 10.0.0.1 \
--master-port 32100 \
--register-to http://frontend:8080 \
--registration-token changeme
--register-to is optional but recommended — it makes the follower visible in the admin UI as an agent-type node tagged with node.role=vllm-follower. Without it the worker just runs vLLM and exits silently when vLLM does. The role label discourages SmartRouter from placing other models on the follower; pair it with model selectors like {"!node.role":"vllm-follower"} if you also run regular LocalAI models on the same fleet.
Worked example: 2-node Kimi-K2.6 deployment
Two A100 nodes (10.0.0.1, 10.0.0.2), 8 GPUs total, data_parallel_size=8 with 4 ranks per node:
# /models/kimi.yaml on the head (10.0.0.1)
name: kimi-k2-6
backend: vllm
parameters:
model: moonshotai/Kimi-K2.6-Instruct
engine_args:
data_parallel_size: 8
data_parallel_size_local: 4
data_parallel_address: 10.0.0.1
data_parallel_rpc_port: 32100
enable_expert_parallel: true
all2all_backend: deepep_high_throughput
# On 10.0.0.2 (follower)
local-ai p2p-worker vllm moonshotai/Kimi-K2.6-Instruct \
--data-parallel-size 8 --data-parallel-size-local 4 --start-rank 4 \
--master-addr 10.0.0.1 --master-port 32100 \
--register-to http://10.0.0.1:8080 --registration-token changeme
A curl http://10.0.0.1:8080/v1/chat/completions ... against the head will then dispatch across all 8 ranks.
Intel Arc / XPU notes
vLLM XPU supports DP (vllm/platforms/xpu.py:198 handles world_size_across_dp > 1; ranks bind to xpu:{local_rank} in xpu_worker.py:62, with xccl as the collective backend). Each rank still needs a distinct discrete GPU — the iGPU on a hybrid host is not a viable second device.
Older XE-HPG GPUs (e.g. Arc A770) need to bypass the cutlass attention path:
engine_args:
attention_backend: TRITON_ATTN
docker-compose.vllm-multinode.intel.yaml at the repo root is the Intel equivalent of docker-compose.vllm-multinode.yaml — uses /dev/dri passthrough, ZE_AFFINITY_MASK to pin each rank to one device, and latest-gpu-intel images. Run via ./tests/e2e/vllm-multinode/smoke.sh --intel.
Caveats
- Tensor parallel within a node only. vLLM v1 does not support TP across nodes; combine
tensor_parallel_size(within a node, viaengine_args) withdata_parallel_size(across nodes). - Followers don't host LocalAI gRPC. The follower process is vanilla vLLM, so
/api/backend-logs/<modelId>does not stream follower output. Usejournalctl/kubectl logs/ compose logs for the follower's stderr. - Network reachability. The head's
data_parallel_rpc_portplus a range of ZMQ ports (typicallydata_parallel_rpc_port..+N) must be reachable from every follower. Open them in your firewall / security group. - Topology must match exactly. A mismatch in
--data-parallel-sizebetween head and any follower will hang the handshake. Check the head's vLLM logs forwaiting for N DP ranksif startup stalls.
Scaling
Adding worker capacity: Start additional worker instances pointing to the same frontend. They self-register automatically:
# Additional workers — no backend type needed
local-ai worker \
--register-to http://frontend:8080 \
--node-name worker-2 \
--nats-url nats://nats:4222 \
--registration-token changeme
local-ai worker \
--register-to http://frontend:8080 \
--node-name worker-3 \
--nats-url nats://nats:4222 \
--registration-token changeme
Multiple frontend replicas: Run multiple LocalAI frontends behind a load balancer. Since all state is in PostgreSQL and coordination is via NATS, frontends are fully stateless and interchangeable.
Model Scheduling
Model scheduling controls where models are placed and how many replicas are maintained. It combines two optional features:
Node Selectors
Pin models to nodes with specific labels. Only nodes matching all selector labels are eligible:
# Only schedule on NVIDIA nodes in the us-east zone
curl -X POST http://frontend:8080/api/nodes/scheduling \
-H "Content-Type: application/json" \
-d '{"model_name": "llama3", "node_selector": {"gpu.vendor": "nvidia", "zone": "us-east"}}'
Without a node selector, models can schedule on any healthy node (default behavior).
Replica Auto-Scaling
Control the number of model replicas across the cluster:
| Field | Description |
|---|---|
min_replicas |
Minimum replicas to maintain (0 = no minimum, single replica default) |
max_replicas |
Maximum replicas allowed (0 = unlimited) |
Auto-scaling is only active when min_replicas > 0 or max_replicas > 0.
# Scale llama3 between 2 and 4 replicas on NVIDIA nodes
curl -X POST http://frontend:8080/api/nodes/scheduling \
-H "Content-Type: application/json" \
-d '{
"model_name": "llama3",
"node_selector": {"gpu.vendor": "nvidia"},
"min_replicas": 2,
"max_replicas": 4
}'
The Replica Reconciler runs as a background process on the frontend:
- Scale up: Adds replicas when all existing replicas are busy (have in-flight requests)
- Scale down: Removes idle replicas after 5 minutes of inactivity
- Maintain minimum: Ensures
min_replicasare always loaded (recovers from node failures) - Eviction protection: Models with auto-scaling enabled are never evicted below
min_replicas
All fields are optional and composable:
- Node selector only: pin model to matching nodes, single replica
- Replicas only: auto-scale across all nodes
- Both: auto-scale on matching nodes only
Label Management API
| Method | Path | Description |
|---|---|---|
GET |
/api/nodes/:id/labels |
Get labels for a node |
PUT |
/api/nodes/:id/labels |
Replace all labels (JSON object) |
PATCH |
/api/nodes/:id/labels |
Merge labels (add/update) |
DELETE |
/api/nodes/:id/labels/:key |
Remove a single label |
Scheduling API
| Method | Path | Description |
|---|---|---|
GET |
/api/nodes/scheduling |
List all scheduling configs |
GET |
/api/nodes/scheduling/:model |
Get config for a model |
POST |
/api/nodes/scheduling |
Create/update config |
DELETE |
/api/nodes/scheduling/:model |
Remove config |
Comparison with P2P
| P2P / Federation | Distributed Mode | |
|---|---|---|
| Discovery | Automatic via libp2p token | Self-registration to frontend URL |
| State storage | In-memory / ledger | PostgreSQL |
| Coordination | Gossip protocol | NATS messaging |
| Node management | Automatic | REST API + WebUI |
| Health monitoring | Peer heartbeats | Centralized HealthMonitor |
| Backend management | Manual per node | Dynamic via NATS backend.install |
| Best for | Ad-hoc clusters, community sharing | Production, Kubernetes, managed infrastructure |
| Setup complexity | Minimal (share a token) | Requires PostgreSQL + NATS |
Troubleshooting
Worker not registering:
- Verify the frontend URL is reachable from the worker (
curl http://frontend:8080/api/node/register) - Check that
--registration-tokenmatches on both frontend and worker - Ensure auth is enabled on the frontend (
LOCALAI_AUTH=true)
NATS connection errors:
- Confirm NATS is running and reachable (
nats-server --signal ldmor check port 4222) - Check that
--nats-urluses the correct hostname/IP from the worker's network perspective
PostgreSQL connection errors:
- Verify the connection URL format:
postgresql://user:password@host:5432/dbname?sslmode=disable - Ensure the database exists and the user has CREATE TABLE permissions (for auto-migration)
- Check that pgvector extension is installed if using RAG features
Node shows as unhealthy or offline:
- The HealthMonitor marks nodes offline when heartbeats are missed. Check network connectivity between worker and frontend.
- Verify
--heartbeat-intervalis not set too high - Offline nodes automatically restore to healthy when they re-register (no re-approval needed)
Backend not installing:
- Check the worker logs for
backend.installevents
Port conflicts on workers:
- Each model gets its own gRPC process on an incrementing port (50051, 50052, ...)
- The HTTP file transfer server runs on the base port - 1 (default: 50050)
- Ensure the port range is not blocked by firewalls or used by other services
- Verify the backend gallery configuration is correct
- The worker needs network access to download backends from the gallery