mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-14 11:49:33 -04:00
feat(distributed): declarative per-model scheduling via env/args (#10308)
* feat(distributed): add SpreadAll column and authoritative scheduling seeding Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): parse declarative model scheduling config (env/file) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): reconcile spread_all to one replica per matching node Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): wire LOCALAI_MODEL_SCHEDULING env/args and startup seeding Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): expose spread_all on the scheduling API endpoint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): add spread-to-all-nodes mode to the scheduling UI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document LOCALAI_MODEL_SCHEDULING env/args Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): clarify replica modes and all-nodes spread in scheduling config Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -604,6 +604,91 @@ All fields are optional and composable:
|
||||
- Replicas only: auto-scale across all nodes
|
||||
- Both: auto-scale on matching nodes only
|
||||
|
||||
### Declarative per-model scheduling (unattended installs)
|
||||
|
||||
In distributed mode you can declare per-model scheduling at startup, instead of
|
||||
using the WebUI/API. Config is **authoritative**: it is re-applied on every boot
|
||||
and overwrites the listed models (models not listed are left untouched).
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `LOCALAI_MODEL_SCHEDULING` | Inline JSON list of scheduling entries |
|
||||
| `LOCALAI_MODEL_SCHEDULING_CONFIG` | Path to a YAML file with the same list |
|
||||
|
||||
Entry fields: `model_name` (required), `node_selector` (a label map; **omit it to
|
||||
match every node**), and then **one of two replica modes** (they are mutually
|
||||
exclusive):
|
||||
|
||||
- **`replicas: all`** - static spread: place exactly **one replica on every
|
||||
matching node**, proactively, regardless of load, and keep it in sync as nodes
|
||||
join and leave. Use this for "run model X everywhere (with this label)".
|
||||
- **`min_replicas` / `max_replicas`** - elastic auto-scaling: keep at least
|
||||
`min_replicas` running, and burst **up to** `max_replicas` only when all
|
||||
replicas are busy, scaling back down to the minimum when idle. `max_replicas: 0`
|
||||
means **no upper bound** (grow to cluster capacity). To enable this mode you
|
||||
must set `min_replicas >= 1` or `max_replicas >= 1` - an entry with only
|
||||
`max_replicas: 0` (and no `replicas: all`) does nothing.
|
||||
|
||||
Net effect at a glance:
|
||||
|
||||
| Config | Behavior |
|
||||
|--------|----------|
|
||||
| `replicas: all` | One replica per matching node, placed immediately, tracks join/leave |
|
||||
| `min_replicas: 1, max_replicas: 0` | Always >=1, bursts to cluster capacity under load, back to 1 when idle |
|
||||
| `min_replicas: 2, max_replicas: 4` | Always >=2, bursts to at most 4 under load |
|
||||
|
||||
`node_selector` constrains which nodes a model may use; with no selector the
|
||||
model may use **all** healthy nodes. So "spread model X across all nodes" is just
|
||||
`replicas: all` with no `node_selector`. `replicas: all` targets one replica per
|
||||
matching node; with the default per-node cap of one replica per model this lands
|
||||
exactly one on each node (see the note below about `LOCALAI_MAX_REPLICAS_PER_MODEL`).
|
||||
|
||||
YAML example (`scheduling.yaml`):
|
||||
|
||||
```yaml
|
||||
# One replica on every GPU-labelled node (static spread, tracks join/leave):
|
||||
- model_name: gpt-oss
|
||||
node_selector:
|
||||
tier: gpu
|
||||
replicas: all
|
||||
|
||||
# One replica on EVERY node in the cluster (no selector = all nodes):
|
||||
- model_name: embeddings
|
||||
replicas: all
|
||||
|
||||
# Elastic on CPU nodes: always >=1, burst to capacity under load, 0 = no cap:
|
||||
- model_name: whisper
|
||||
node_selector:
|
||||
tier: cpu
|
||||
min_replicas: 1
|
||||
max_replicas: 0
|
||||
```
|
||||
|
||||
```bash
|
||||
LOCALAI_DISTRIBUTED=true \
|
||||
LOCALAI_MODEL_SCHEDULING_CONFIG=/etc/localai/scheduling.yaml \
|
||||
local-ai run
|
||||
```
|
||||
|
||||
Inline equivalent:
|
||||
|
||||
```bash
|
||||
LOCALAI_MODEL_SCHEDULING='[{"model_name":"gpt-oss","node_selector":{"tier":"gpu"},"replicas":"all"}]'
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- Because the config is authoritative, each listed model's **entire** scheduling
|
||||
row is replaced on every boot, including the optional prefix-cache routing
|
||||
overrides (`route_policy`, `balance_abs_threshold`, `balance_rel_threshold`,
|
||||
`min_prefix_match`). For a model you manage via this config, set those fields
|
||||
here too if you need non-default values; values set only through the API are
|
||||
reset on the next restart. Models not listed in the config are never touched.
|
||||
- `replicas: all` places one replica per matching node by relying on the default
|
||||
per-node cap of one replica per model. If you raise `LOCALAI_MAX_REPLICAS_PER_MODEL`
|
||||
on a worker above 1, the target count can be met by stacking replicas on fewer
|
||||
nodes rather than spreading one to each.
|
||||
|
||||
## Label Management API
|
||||
|
||||
| Method | Path | Description |
|
||||
|
||||
Reference in New Issue
Block a user