feat(distributed): declarative per-model scheduling via env/args (#10308)

* feat(distributed): add SpreadAll column and authoritative scheduling seeding

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): parse declarative model scheduling config (env/file)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): reconcile spread_all to one replica per matching node

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): wire LOCALAI_MODEL_SCHEDULING env/args and startup seeding

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): expose spread_all on the scheduling API endpoint

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): add spread-to-all-nodes mode to the scheduling UI

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document LOCALAI_MODEL_SCHEDULING env/args

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): clarify replica modes and all-nodes spread in scheduling config

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
LocalAI [bot]
2026-06-13 18:31:06 +02:00
committed by GitHub
parent f0e001b7f8
commit 7637f8cf1b
13 changed files with 577 additions and 11 deletions

View File

@@ -604,6 +604,91 @@ All fields are optional and composable:
- Replicas only: auto-scale across all nodes
- Both: auto-scale on matching nodes only
### Declarative per-model scheduling (unattended installs)
In distributed mode you can declare per-model scheduling at startup, instead of
using the WebUI/API. Config is **authoritative**: it is re-applied on every boot
and overwrites the listed models (models not listed are left untouched).
| Variable | Description |
|----------|-------------|
| `LOCALAI_MODEL_SCHEDULING` | Inline JSON list of scheduling entries |
| `LOCALAI_MODEL_SCHEDULING_CONFIG` | Path to a YAML file with the same list |
Entry fields: `model_name` (required), `node_selector` (a label map; **omit it to
match every node**), and then **one of two replica modes** (they are mutually
exclusive):
- **`replicas: all`** - static spread: place exactly **one replica on every
matching node**, proactively, regardless of load, and keep it in sync as nodes
join and leave. Use this for "run model X everywhere (with this label)".
- **`min_replicas` / `max_replicas`** - elastic auto-scaling: keep at least
`min_replicas` running, and burst **up to** `max_replicas` only when all
replicas are busy, scaling back down to the minimum when idle. `max_replicas: 0`
means **no upper bound** (grow to cluster capacity). To enable this mode you
must set `min_replicas >= 1` or `max_replicas >= 1` - an entry with only
`max_replicas: 0` (and no `replicas: all`) does nothing.
Net effect at a glance:
| Config | Behavior |
|--------|----------|
| `replicas: all` | One replica per matching node, placed immediately, tracks join/leave |
| `min_replicas: 1, max_replicas: 0` | Always >=1, bursts to cluster capacity under load, back to 1 when idle |
| `min_replicas: 2, max_replicas: 4` | Always >=2, bursts to at most 4 under load |
`node_selector` constrains which nodes a model may use; with no selector the
model may use **all** healthy nodes. So "spread model X across all nodes" is just
`replicas: all` with no `node_selector`. `replicas: all` targets one replica per
matching node; with the default per-node cap of one replica per model this lands
exactly one on each node (see the note below about `LOCALAI_MAX_REPLICAS_PER_MODEL`).
YAML example (`scheduling.yaml`):
```yaml
# One replica on every GPU-labelled node (static spread, tracks join/leave):
- model_name: gpt-oss
node_selector:
tier: gpu
replicas: all
# One replica on EVERY node in the cluster (no selector = all nodes):
- model_name: embeddings
replicas: all
# Elastic on CPU nodes: always >=1, burst to capacity under load, 0 = no cap:
- model_name: whisper
node_selector:
tier: cpu
min_replicas: 1
max_replicas: 0
```
```bash
LOCALAI_DISTRIBUTED=true \
LOCALAI_MODEL_SCHEDULING_CONFIG=/etc/localai/scheduling.yaml \
local-ai run
```
Inline equivalent:
```bash
LOCALAI_MODEL_SCHEDULING='[{"model_name":"gpt-oss","node_selector":{"tier":"gpu"},"replicas":"all"}]'
```
Notes:
- Because the config is authoritative, each listed model's **entire** scheduling
row is replaced on every boot, including the optional prefix-cache routing
overrides (`route_policy`, `balance_abs_threshold`, `balance_rel_threshold`,
`min_prefix_match`). For a model you manage via this config, set those fields
here too if you need non-default values; values set only through the API are
reset on the next restart. Models not listed in the config are never touched.
- `replicas: all` places one replica per matching node by relying on the default
per-node cap of one replica per model. If you raise `LOCALAI_MAX_REPLICAS_PER_MODEL`
on a worker above 1, the target count can be met by stacking replicas on fewer
nodes rather than spreading one to each.
## Label Management API
| Method | Path | Description |