feat(concurrency-groups): per-model exclusive groups for backend loading (#9662)

* feat(concurrency-groups): per-model exclusive groups for backend loading Adds `concurrency_groups: [...]` to model YAML configs. Two models that share a group cannot be loaded concurrently on the same node — loading one evicts the others, reusing the existing pinned/busy/retry policy from LRU eviction. Layered design: - Watchdog (pkg/model): per-node correctness floor — on every Load(), evict any loaded model that shares a group with the requested one. Pinned skips surface NeedMore so the loader retries (and ultimately logs a clear warning), instead of silently allowing the rule to be violated. - Distributed scheduler (core/services/nodes): soft anti-affinity hint — scheduleNewModel prefers nodes that don't already host a same-group model, falling back to eviction only if every candidate has a conflict. Composes with NodeSelector at the same point in the candidate pipeline. Per-node, not cluster-wide: VRAM is a node-local resource, and two heavy models running on different nodes is fine. The ConfigLoader is wired into SmartRouter via a small ConcurrencyConflictResolver interface so the nodes package keeps a narrow surface on core/config. Refactors the inner LRU eviction body into a shared collectEvictionsLocked helper and the loader retry loop into retryEnforce(fn, maxRetries, interval), so both LRU and group enforcement share busy/pinned/retry semantics. Closes #9659. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(watchdog): sync pinned + concurrency_groups at startup The startup-time watchdog setup lives in initializeWatchdog (startup.go), not in startWatchdog (watchdog.go). The latter is only invoked from the runtime-settings RestartWatchdog path. As a result, neither SyncPinnedModelsToWatchdog nor SyncModelGroupsToWatchdog ran at boot, so `pinned: true` and `concurrency_groups: [...]` only became effective after a settings-driven watchdog restart. Fix by adding both sync calls to initializeWatchdog. Confirmed end-to-end: loading model A in group "heavy", then C with no group (coexists), then B in group "heavy" now correctly evicts A and leaves [B, C]. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(test): satisfy errcheck on new os.Remove in concurrency_groups spec CI lint runs new-from-merge-base, so the existing pre-existing `defer os.Remove(tmp.Name())` lines are baseline-grandfathered but the one introduced by the concurrency_groups YAML round-trip test is held to errcheck. Wrap the remove in a closure that discards the error. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-16 20:52:08 -04:00 · 2026-05-05 08:42:50 +02:00
parent 22ae415695
commit bbcaebc1ef
17 changed files with 981 additions and 76 deletions
--- a/docs/content/advanced/vram-management.md
+++ b/docs/content/advanced/vram-management.md
@@ -8,7 +8,8 @@ url = '/advanced/vram-management'
 When running multiple models in LocalAI, especially on systems with limited GPU memory (VRAM), you may encounter situations where loading a new model fails because there isn't enough available VRAM. LocalAI provides several mechanisms to automatically manage model memory allocation and prevent VRAM exhaustion:

 1. **Max Active Backends (LRU Eviction)**: Limit the number of loaded models, evicting the least recently used when the limit is reached
-2. **Watchdog Mechanisms**: Automatically unload idle or stuck models based on configurable timeouts
+2. **Concurrency Groups**: Per-model anti-affinity rules that prevent specific models from coexisting on the same node
+3. **Watchdog Mechanisms**: Automatically unload idle or stuck models based on configurable timeouts

 ## The Problem

@@ -136,6 +137,86 @@ LOCALAI_SINGLE_ACTIVE_BACKEND=true ./local-ai
 - When you only need one model active at a time
 - Simple deployments where model switching is acceptable

+## Solution 1b: Concurrency Groups (per-model anti-affinity)
+
+`--max-active-backends` is a global count — three loaded models is fine, but it
+doesn't know that two of them are 120B and shouldn't share a GPU.
+**Concurrency groups** give per-model rules: any two models that share a group
+name are mutually exclusive on the same node. Loading one evicts the others.
+Models with no groups behave exactly as before.
+
+This addresses [issue #9659](https://github.com/mudler/LocalAI/issues/9659):
+
+> allow my zed prediction model to run alongside anything but don't allow my
+> two 120b models to run alongside each other
+
+### Configuration
+
+Declare groups per model in the YAML config — no CLI flag, no env var:
+
+```yaml
+# llama-120b-a.yaml
+name: llama-120b-a
+backend: llama-cpp
+parameters:
+  model: llama-120b-a.gguf
+concurrency_groups: ["vram-heavy"]
+```
+
+```yaml
+# llama-120b-b.yaml
+name: llama-120b-b
+backend: llama-cpp
+parameters:
+  model: llama-120b-b.gguf
+concurrency_groups: ["vram-heavy"]
+```
+
+```yaml
+# zed-predict.yaml — no groups, runs alongside anything
+name: zed-predict
+backend: llama-cpp
+parameters:
+  model: zed-predict.gguf
+```
+
+With this configuration:
+
+1. Request `zed-predict` → loads.
+2. Request `llama-120b-a` → loads alongside `zed-predict`.
+3. Request `llama-120b-b` → `llama-120b-a` is evicted (shared group
+   `vram-heavy`); `zed-predict` stays loaded.
+
+A model can declare multiple groups; two models conflict if they share **any**
+group name. Group names are arbitrary strings — pick names that make sense for
+your hardware (`vram-heavy`, `gpu-1`, `large-context`, ...).
+
+### Interaction with other knobs
+
+- **`--max-active-backends`**: groups are checked *before* the LRU cap. Group
+  evictions may already make room; LRU then enforces the global count.
+- **`pinned: true`**: a pinned model is never evicted, including by a group
+  conflict. The new request is loaded with a warning logged — pinning two
+  models in the same group is a configuration mismatch.
+- **`--force-eviction-when-busy`**: same retry semantics as LRU. A busy
+  conflict is skipped and retried (`--lru-eviction-max-retries`,
+  `--lru-eviction-retry-interval`); after retries exhaust, the load proceeds
+  with a warning.
+
+### Distributed mode
+
+`concurrency_groups` is enforced **per node**, not cluster-wide — VRAM is a
+node-local resource, so two heavy models on different nodes is fine. The
+distributed scheduler additionally uses the rule as a placement hint: when
+choosing where to load a new model, it prefers nodes that don't already host a
+same-group model, falling back to eviction only if every candidate has a
+conflict.
+
+`concurrency_groups` composes with `NodeSelector` (which decides *which
+nodes* a model is eligible for) — the two filters apply in sequence. Use
+`NodeSelector` to target hardware classes; use `concurrency_groups` to keep
+specific models from co-residing on whichever node hosts them.
+
 ## Solution 2: Watchdog Mechanisms

 For more flexible memory management, LocalAI provides watchdog mechanisms that automatically unload models based on their activity state. This allows multiple models to be loaded simultaneously, but automatically frees memory when models become inactive or stuck.