From 6b63b47f61766c9a8a6a7052e03c14fb22da9084 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Mon, 27 Apr 2026 21:20:05 +0200 Subject: [PATCH] feat(distributed): support multiple replicas of one model on the same node (#9583) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * feat(distributed): support multiple replicas of one model on the same node The distributed scheduler implicitly assumed `(node_id, model_name)` was unique, but the schema didn't enforce it and the worker keyed all gRPC processes by model name alone. With `MinReplicas=2` against a single worker, the reconciler "scaled up" every 30s but the registry never advanced past 1 row — the worker re-loaded the model in-place every tick until VRAM fragmented and the gRPC process died. This change introduces multi-replica-per-node as a first-class concept, with capacity-aware scheduling, a circuit breaker, and VRAM soft-reservation. Operators can declare per-node capacity via the worker flag `--max-replicas-per-model` (mirrored as auto-label `node.replica-slots=N`) or override per-node from the UI. * Schema: BackendNode gains MaxReplicasPerModel (default 1) and ReservedVRAM. NodeModel gains ReplicaIndex (composite with node_id + model_name). ModelSchedulingConfig gains UnsatisfiableUntil/Ticks for the reconciler circuit breaker. * Registry: replica_index threaded through SetNodeModel, RemoveNodeModel, IncrementInFlight, DecrementInFlight, TouchNodeModel, GetNodeModel, SetNodeModelLoadInfo and the InFlightTrackingClient. New helpers: CountReplicasOnNode, NextFreeReplicaIndex (with ErrNoFreeSlot), RemoveAllNodeModelReplicas, FindNodesWithFreeSlot, ClusterCapacityForModel, ReserveVRAM/ReleaseVRAM (atomic UPDATE with ErrInsufficientVRAM), and the unsatisfiable-flag CRUD. * Worker: processKey now `#` so concurrent loads of the same model land on distinct ports. Adds CLI flag --max-replicas-per-model (env LOCALAI_MAX_REPLICAS_PER_MODEL, default 1) and emits the auto-label. * Router: scheduleNewModel filters candidates by free slot, allocates the replica index, and soft-reserves VRAM before installing the backend. evictLRUAndFreeNode now deletes the targeted row by ID instead of all replicas of the model on the node — fixes a latent bug where evicting one replica orphaned its siblings. * Reconciler: caps scale-up at ClusterCapacityForModel so a misconfig (MinReplicas > capacity) doesn't loop forever. After 3 consecutive ticks of capacity==0 it sets UnsatisfiableUntil for a 5m cooldown and emits a warning. ClearAllUnsatisfiable fires from Register, ApproveNode, SetNodeLabel(s), RemoveNodeLabel and UpdateMaxReplicasPerModel so a new node joining or label changes wake the reconciler immediately. scaleDownIdle removes highest-replica-index first to keep slots compact. * Heartbeat resets reserved_vram to 0 — worker is the source of truth for actual free VRAM; the reservation is only for the in-tick race window between two scheduling decisions. * Probe path (reconciler.probeLoadedModels and health.doCheckAll) now pass the row's replica_index to RemoveNodeModel so an unreachable replica doesn't orphan healthy siblings. * Admin override: PUT /api/nodes/:id/max-replicas-per-model sets a sticky override (preserved across worker re-registration). DELETE clears the override so the worker's flag applies again on next register. Required because Kong defaults the worker flag to 1, so every worker restart would have silently reverted the UI value. * React UI: always-visible slot badge on the node row (muted at default 1, accented when >1); inline editor in the expanded drawer with pencil-to-edit, Save/Cancel, Esc/Enter, "(override)" indicator when the value is admin-set, and a "Reset" button to hand control back to the worker. Soft confirm when shrinking the cap below the count of loaded replicas. Scheduling rules table gets an "Unsatisfiable until HH:MM" status badge surfacing the cooldown. * node.replica-slots filtered out of the labels strip on the row to avoid duplicating the slot badge. 23 new Ginkgo specs (registry, reconciler, inflight, health) cover: multi-replica row independence, RemoveNodeModel of one replica preserving siblings, NextFreeReplicaIndex slot allocation including ErrNoFreeSlot, capacity-gated scale-up with circuit breaker tripping and recovery on Register, scheduleDownIdle ordering, ClusterCapacity math, ReserveVRAM admission gating, Heartbeat reset, override survival across worker re-registration, and ResetMaxReplicasPerModel handing control back. Plus 8 stdlib tests for the worker processKey / CLI / auto-label. Closes the flap reproduced on Qwen3.6-35B against the nvidia-thor worker (single 128 GiB node, MinReplicas=2): the reconciler now caps the scale-up at the cluster's actual capacity instead of looping. Signed-off-by: Ettore Di Giacinto Assisted-by: claude-code:opus-4-7 [Read] [Edit] [Bash] [Skill:critique] [Skill:audit] [Skill:polish] [Skill:golang-testing] * refactor(react-ui/nodes): tighten capacity editor copy + adopt ActionMenu for row actions * Capacity editor hint trimmed from operator-doc-style ("Sourced from the worker's `--max-replicas-per-model` flag. Changing it here makes it a sticky admin override that survives worker restarts." → "Saved values stick across worker restarts.") and the override-state copy similarly compressed. The full mechanic is no longer needed in the UI — the override pill carries the meaning and the docs cover the rest. * Node row actions migrated from an inline cluster of icon buttons (Drain / Resume / Trash) to the kebab ActionMenu used by /manage for per-row model actions, so dense Nodes tables stay clean. Approve stays as a prominent primary button — it's a stateful admission gate, not a routine action, and elevating it matches how /manage surfaces install-time decisions outside the menu. * The expanded drawer's Labels section now filters node.replica-slots out of the editable label list. The label is owned by the Capacity editor above; surfacing it again as an editable label invited confusion (the Capacity save would clobber any direct edit). Both backend and agent workers benefit — they share the row rendering path, so the action menu and label filter apply to both. Signed-off-by: Ettore Di Giacinto Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] [Skill:critique] [Skill:audit] [Skill:polish] * fix(react-ui/nodes): suppress slot badge on agent workers Agent workers don't load models, so the per-node replica capacity is inapplicable to them. Showing "1× slots" on agent rows was a tiny inconsistency from the unified rendering path — gate the badge on node_type !== 'agent' so it only appears on backend workers. Signed-off-by: Ettore Di Giacinto Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] * refactor(react-ui/nodes): distill expanded drawer + restyle scheduling form The expanded node drawer used to stack five panels — slot badge, filled capacity box, Loaded Models h4+empty-state, Installed Backends h4+empty-state, Labels h4+chips+form — making routine inspections feel like a control panel. The scheduling rule form wrapped its mode toggle as two 50%-width filled buttons that competed visually with the actual primary action. * Drawer: collapse three rarely-touched config zones (Capacity, Backends, Labels) into one `
` "Manage" disclosure (closed by default) with small uppercase eyebrow labels for each zone instead of parallel h4 sub-headings. Loaded Models stays as the at-a-glance headline with a single-line empty hint instead of a boxed empty state. CapacityEditor renders flat (no filled background) — the Manage disclosure provides framing. * Scheduling form: replace the chunky 50%-width button-tabs with the project's existing `.segmented` control (icon + label, sized to content). Mode hint becomes a single tied line below. Fields stack vertically with helper text under inputs and a hairline divider above the right-aligned Save / Cancel. The empty drawer collapses from ~5 stacked sections (~280px tall) to two lines (~80px). The scheduling form now reads as a designed dialog instead of raw building blocks. Both surfaces now match the typographic density and weight of the rest of the admin pages. Signed-off-by: Ettore Di Giacinto Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] [Skill:distill] [Skill:audit] [Skill:polish] * feat(react-ui/nodes): replace scheduling form's model picker with searchable combobox The native setDraft(e.target.value)} + onKeyDown={onKeyDown} + autoFocus + aria-describedby={`capacity-hint-${node.id}`} + style={{ + width: 72, padding: '4px 8px', borderRadius: 'var(--radius-sm)', + border: '1px solid var(--color-border)', background: 'var(--color-bg-primary)', + fontFamily: 'var(--font-mono)', fontSize: '0.8125rem', + color: 'var(--color-text-primary)', + }} + /> + + + + ) : ( + <> + + {current} + + {isOverride && ( + + override + + )} + + {isOverride && ( + + )} + + )} + +
+ {isOverride + ? <>Set from here. Reset to use the worker's default. + : <>Saved values stick across worker restarts.} +
+ + + ) +} + function gpuVendorLabel(vendor) { const labels = { nvidia: 'NVIDIA', @@ -148,91 +342,266 @@ function WorkerHintCard({ addToast, activeTab, hasWorkers }) { ) } +// Numeric input with quick-pick preset chips. Picked over a slider because +// replica counts are exact specs (operator math), not fuzzy estimates. The +// chips give one-click access to common values without the slider's +// precision/special-value problems (e.g. MaxReplicas=0 = "no limit"). +function ReplicaInput({ id, label, value, onChange, presets }) { + return ( +
+ + onChange(parseInt(e.target.value) || 0)} + /> +
+ {presets.map(({ v, l }) => { + const active = value === v + return ( + + ) + })} +
+
+ ) +} + +/** + * Controlled chip-builder for { key: value } maps. Replaces the prior + * comma-separated-string Node Selector input AND the bespoke Labels editor + * in the node drawer — both were rendering the same chip pattern with + * subtly different markup. + * + * Fully controlled: parent owns the map and decides what onAdd/onRemove + * does (form state for the scheduling form; API calls for the live + * labels editor). The component just renders chips and a key/value input + * row. + * + * Props: + * pairs — current map of key → value + * onAdd(k,v) — called when the user adds a pair (parent handles dedup + * and persistence side effects) + * onRemove(k) — called when a chip's × is clicked + * placeholderKey, placeholderValue — input hints + * ariaLabel — accessible name for the section + */ +function KeyValueChips({ pairs, onAdd, onRemove, placeholderKey = 'key', placeholderValue = 'value', ariaLabel }) { + const [k, setK] = useState('') + const [v, setV] = useState('') + + const add = () => { + const key = k.trim() + if (!key) return + onAdd(key, v.trim()) + setK(''); setV('') + } + const onKeyDown = (e) => { + if (e.key === 'Enter') { e.preventDefault(); add() } + } + + const entries = pairs ? Object.entries(pairs) : [] + return ( +
+ {entries.length > 0 && ( +
+ {entries.map(([key, val]) => ( + + {key}={val} + + + ))} +
+ )} +
+ setK(e.target.value)} + onKeyDown={onKeyDown} + style={{ flex: 1 }} + /> + setV(e.target.value)} + onKeyDown={onKeyDown} + style={{ flex: 1 }} + /> + +
+
+ ) +} + function SchedulingForm({ onSave, onCancel }) { const [mode, setMode] = useState('placement') const [modelName, setModelName] = useState('') - const [selectorText, setSelectorText] = useState('') + // Selector is now a chip-builder map instead of a comma-separated string. + // Operators were copying syntax from docs and missing commas; the chip UI + // makes the key=value structure self-documenting. + const [selector, setSelector] = useState({}) const [minReplicas, setMinReplicas] = useState(1) const [maxReplicas, setMaxReplicas] = useState(0) - const { models } = useModels() - const parseSelector = () => { - if (!selectorText.trim()) return null - const pairs = {} - selectorText.split(',').forEach(p => { - const [k, v] = p.split('=').map(s => s.trim()) - if (k) pairs[k] = v || '' - }) - return Object.keys(pairs).length > 0 ? pairs : null - } + const hasSelector = Object.keys(selector).length > 0 const isValid = () => { if (!modelName) return false - if (mode === 'placement') return !!parseSelector() + if (mode === 'placement') return hasSelector return minReplicas > 0 || maxReplicas > 0 } const handleSubmit = () => { - const nodeSelector = parseSelector() onSave({ model_name: modelName, - node_selector: nodeSelector || undefined, + node_selector: hasSelector ? selector : undefined, min_replicas: mode === 'placement' ? 0 : minReplicas, max_replicas: mode === 'placement' ? 0 : maxReplicas, }) } - const modeBtn = (value, label) => ( - - ) - return ( -
-
- {modeBtn('placement', 'Node Placement')} - {modeBtn('autoscaling', 'Auto-Scaling')} +
+ {/* Mode selector \u2014 uses the project's segmented control instead of two + 50%-width filled buttons that competed visually with the actual + primary action (Save). */} +
+