feat(concurrency-groups): per-model exclusive groups for backend loading (#9662)

* feat(concurrency-groups): per-model exclusive groups for backend loading

Adds `concurrency_groups: [...]` to model YAML configs. Two models that share
a group cannot be loaded concurrently on the same node — loading one evicts
the others, reusing the existing pinned/busy/retry policy from LRU eviction.

Layered design:
- Watchdog (pkg/model): per-node correctness floor — on every Load(), evict
  any loaded model that shares a group with the requested one. Pinned skips
  surface NeedMore so the loader retries (and ultimately logs a clear
  warning), instead of silently allowing the rule to be violated.
- Distributed scheduler (core/services/nodes): soft anti-affinity hint —
  scheduleNewModel prefers nodes that don't already host a same-group
  model, falling back to eviction only if every candidate has a conflict.
  Composes with NodeSelector at the same point in the candidate pipeline.

Per-node, not cluster-wide: VRAM is a node-local resource, and two heavy
models running on different nodes is fine. The ConfigLoader is wired into
SmartRouter via a small ConcurrencyConflictResolver interface so the nodes
package keeps a narrow surface on core/config.

Refactors the inner LRU eviction body into a shared collectEvictionsLocked
helper and the loader retry loop into retryEnforce(fn, maxRetries, interval),
so both LRU and group enforcement share busy/pinned/retry semantics.

Closes #9659.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(watchdog): sync pinned + concurrency_groups at startup

The startup-time watchdog setup lives in initializeWatchdog (startup.go),
not in startWatchdog (watchdog.go). The latter is only invoked from the
runtime-settings RestartWatchdog path. As a result, neither
SyncPinnedModelsToWatchdog nor SyncModelGroupsToWatchdog ran at boot,
so `pinned: true` and `concurrency_groups: [...]` only became effective
after a settings-driven watchdog restart.

Fix by adding both sync calls to initializeWatchdog. Confirmed end-to-end:
loading model A in group "heavy", then C with no group (coexists),
then B in group "heavy" now correctly evicts A and leaves [B, C].

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(test): satisfy errcheck on new os.Remove in concurrency_groups spec

CI lint runs new-from-merge-base, so the existing pre-existing
`defer os.Remove(tmp.Name())` lines are baseline-grandfathered but the
one introduced by the concurrency_groups YAML round-trip test is held
to errcheck. Wrap the remove in a closure that discards the error.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-05-05 08:42:50 +02:00
committed by GitHub
parent 22ae415695
commit bbcaebc1ef
17 changed files with 981 additions and 76 deletions

View File

@@ -8,7 +8,8 @@ url = '/advanced/vram-management'
When running multiple models in LocalAI, especially on systems with limited GPU memory (VRAM), you may encounter situations where loading a new model fails because there isn't enough available VRAM. LocalAI provides several mechanisms to automatically manage model memory allocation and prevent VRAM exhaustion:
1. **Max Active Backends (LRU Eviction)**: Limit the number of loaded models, evicting the least recently used when the limit is reached
2. **Watchdog Mechanisms**: Automatically unload idle or stuck models based on configurable timeouts
2. **Concurrency Groups**: Per-model anti-affinity rules that prevent specific models from coexisting on the same node
3. **Watchdog Mechanisms**: Automatically unload idle or stuck models based on configurable timeouts
## The Problem
@@ -136,6 +137,86 @@ LOCALAI_SINGLE_ACTIVE_BACKEND=true ./local-ai
- When you only need one model active at a time
- Simple deployments where model switching is acceptable
## Solution 1b: Concurrency Groups (per-model anti-affinity)
`--max-active-backends` is a global count — three loaded models is fine, but it
doesn't know that two of them are 120B and shouldn't share a GPU.
**Concurrency groups** give per-model rules: any two models that share a group
name are mutually exclusive on the same node. Loading one evicts the others.
Models with no groups behave exactly as before.
This addresses [issue #9659](https://github.com/mudler/LocalAI/issues/9659):
> allow my zed prediction model to run alongside anything but don't allow my
> two 120b models to run alongside each other
### Configuration
Declare groups per model in the YAML config — no CLI flag, no env var:
```yaml
# llama-120b-a.yaml
name: llama-120b-a
backend: llama-cpp
parameters:
model: llama-120b-a.gguf
concurrency_groups: ["vram-heavy"]
```
```yaml
# llama-120b-b.yaml
name: llama-120b-b
backend: llama-cpp
parameters:
model: llama-120b-b.gguf
concurrency_groups: ["vram-heavy"]
```
```yaml
# zed-predict.yaml — no groups, runs alongside anything
name: zed-predict
backend: llama-cpp
parameters:
model: zed-predict.gguf
```
With this configuration:
1. Request `zed-predict` → loads.
2. Request `llama-120b-a` → loads alongside `zed-predict`.
3. Request `llama-120b-b``llama-120b-a` is evicted (shared group
`vram-heavy`); `zed-predict` stays loaded.
A model can declare multiple groups; two models conflict if they share **any**
group name. Group names are arbitrary strings — pick names that make sense for
your hardware (`vram-heavy`, `gpu-1`, `large-context`, ...).
### Interaction with other knobs
- **`--max-active-backends`**: groups are checked *before* the LRU cap. Group
evictions may already make room; LRU then enforces the global count.
- **`pinned: true`**: a pinned model is never evicted, including by a group
conflict. The new request is loaded with a warning logged — pinning two
models in the same group is a configuration mismatch.
- **`--force-eviction-when-busy`**: same retry semantics as LRU. A busy
conflict is skipped and retried (`--lru-eviction-max-retries`,
`--lru-eviction-retry-interval`); after retries exhaust, the load proceeds
with a warning.
### Distributed mode
`concurrency_groups` is enforced **per node**, not cluster-wide — VRAM is a
node-local resource, so two heavy models on different nodes is fine. The
distributed scheduler additionally uses the rule as a placement hint: when
choosing where to load a new model, it prefers nodes that don't already host a
same-group model, falling back to eviction only if every candidate has a
conflict.
`concurrency_groups` composes with `NodeSelector` (which decides *which
nodes* a model is eligible for) — the two filters apply in sequence. Use
`NodeSelector` to target hardware classes; use `concurrency_groups` to keep
specific models from co-residing on whichever node hosts them.
## Solution 2: Watchdog Mechanisms
For more flexible memory management, LocalAI provides watchdog mechanisms that automatically unload models based on their activity state. This allows multiple models to be loaded simultaneously, but automatically frees memory when models become inactive or stuck.