mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-30 19:47:47 -04:00
v4.3.1
29 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
a0f3e26245 |
fix(distributed): make admin backend installs resilient and observable (#9958)
* feat(distributed): add configurable NATS backend install/upgrade timeouts Adds BackendInstallTimeout and BackendUpgradeTimeout to DistributedConfig with 15m defaults, following the existing MCPToolTimeout / WorkerWaitTimeout pattern. These will replace the hardcoded literals in RemoteUnloaderAdapter so admin-driven backend installs across the cluster survive long OCI image pulls that previously timed out at 3m. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(distributed): gofmt alignment after timeout fields Re-aligns the Validate() negative-duration map and the Default* const block so the new BackendInstall/UpgradeTimeout entries do not leave the surrounding columns mis-padded. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(cli): surface LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT and _UPGRADE_TIMEOUT Parses the two new env vars on the run CLI and threads them through the existing AppOption builder so DistributedConfig picks them up. Invalid duration strings now fail loudly at startup rather than silently falling back to the default. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): inject NATS install/upgrade timeouts into RemoteUnloaderAdapter Removes the hardcoded 3m / 15m literals from RemoteUnloaderAdapter and threads in DistributedConfig.BackendInstallTimeoutOrDefault() and BackendUpgradeTimeoutOrDefault() at construction. Install now defaults to 15m (was 3m); cold OCI image pulls on Jetson Wi-Fi routinely blew past the old ceiling. Scripted messaging client captures the timeout so tests can assert the configured value actually reaches the NATS request. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): introduce galleryop.ErrWorkerStillInstalling sentinel When the NATS request-reply for backend.install (or .upgrade) times out the worker is almost always still pulling the OCI image. Wrap the timeout in a typed sentinel so the manager above can distinguish "worker hung" from "worker still working" and leave the pending_backend_ops row in place for the reconciler to confirm via backend.list. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): treat NATS install timeout as in-progress, not failure When a worker times out replying to backend.install but the install is still running on the worker, enqueueAndDrainBackendOp now reports a running_on_worker status and pushes NextRetryAt out by the install timeout so the reconciler does not immediately re-fire another install while the worker is still pulling the image. The pending_backend_ops row stays in place for the next reconciler pass to confirm via backend.list. InstallBackend wraps the result in galleryop.ErrWorkerStillInstalling so callers can branch (galleryop renders yellow in-progress instead of red error). UpgradeBackend uses the same wrap. Adds RemoteUnloaderAdapter.InstallTimeout() so the manager can push NextRetryAt by the configured timeout without reaching into a private field, and NodeRegistry.RecordPendingBackendOpInFlight as the soft cousin of RecordPendingBackendOpFailure. Also includes incidental gofmt-driven struct-field alignment in registry.go on lines unrelated to the change (touched files are re-formatted to canonical form per project policy). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed): don't increment Attempts on in-flight install timeout An in-flight timeout (worker still pulling the OCI image) is not a failed attempt, it's a delayed one. Incrementing Attempts let genuinely-progressing slow installs (e.g. 30 GB CUDA images on Wi-Fi) trip the reconciler's maxPendingBackendOpAttempts cap and dead-letter the queue row while the worker was still legitimately working. RecordPendingBackendOpInFlight now only updates LastError and NextRetryAt. Also documents "running_on_worker" in the NodeOpStatus.Status enum comment so Task 6 implementers see the full surface. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): surface ErrWorkerStillInstalling as non-error OpStatus When the distributed backend manager returns an error that wraps ErrWorkerStillInstalling, backendHandler now completes the op with a "still installing in background" message rather than marking it as a red failure. Admin UI sees a yellow in-progress state; reconciler confirms completion on its next pass. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): end-to-end install-timeout-then-reconcile Wires Task 1-6 end-to-end so any seam mismatch surfaces in CI rather than during a real cluster install. NATS times out, the queue row stays alive with running_on_worker status, the worker eventually reports the backend installed via backend.list, the manager surfaces it via ListBackends. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT / _UPGRADE_TIMEOUT Add the two new operator-tunable env vars to the Frontend Configuration table in the distributed-mode docs. Explains the 15m default, when to raise it (slow links pulling multi-GB OCI images), and the new "still installing in background" admin-UI state when the round-trip times out but the worker is still working. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): clear pending install rows when backend.list confirms DistributedBackendManager.ListBackends now proactively clears pending_backend_ops install rows whose (nodeID, backend) is reported installed by backend.list. Operator UI updates immediately instead of waiting up to installTimeout (default 15m) for the next reconciler tick after NextRetryAt. Only install rows are cleared; upgrade and delete intents are not satisfied by presence in backend.list and continue to drain through their normal reconciler paths. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(messaging): add BackendInstallProgressEvent wire type and subject New NATS subject nodes.<nodeID>.backend.install.<opID>.progress lets the worker publish transient progress events (file, current/total bytes, percentage, phase) while a long-running install pulls its OCI image. BackendInstallRequest gains an optional OpID field so the worker knows which subject to publish on. Transient pub/sub (not JetStream): the install reply remains ground truth for success/failure; dropped progress events are tolerable. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(messaging): drop em-dash from BackendInstallProgress test comment Per project convention (no em-dashes anywhere). Comment substance is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): worker publishes debounced install progress over NATS When BackendInstallRequest.OpID is set, the worker's backend.install handler wires a debounced publisher (250ms window) into the gallery download callback. Each tick becomes a BackendInstallProgressEvent on nodes.<nodeID>.backend.install.<opID>.progress; the publisher always emits a final event on Flush so the UI sees the terminal percentage. Old masters that do not set OpID continue to run silent installs: no behavior change for them. Lock ordering: the publisher releases its mutex before calling messaging.Publish so a slow network never stalls the install loop. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): RemoteUnloaderAdapter subscribes to install progress InstallBackend gains opID + onProgress parameters. When both are set, the adapter subscribes to nodes.<nodeID>.backend.install.<opID>.progress BEFORE publishing the install request, decodes each message into the caller's onProgress callback in a goroutine (so a slow callback never stalls the NATS reader thread), and unsubscribes after RequestJSON returns. When onProgress is nil OR opID is empty (the reconciler retry path), subscription is skipped entirely - silent installs cost nothing extra. Subscribe failure is logged at Warn and the install proceeds without progress streaming; the NATS round-trip still owns terminal status. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): forward backend install progress into galleryop OpStatus DistributedBackendManager.InstallBackend now passes the gallery op ID and a progress bridge into the adapter call. Each BackendInstallProgressEvent from the worker becomes a galleryop.ProgressCallback tick - which the existing backendHandler already turns into OpStatus.UpdateStatus, so the admin UI/SSE polling sees per-byte progress for distributed installs without any UI-side change. UpgradeBackend is intentionally left silent for now: its wire request (BackendUpgradeRequest) does not carry OpID, and rolling-update fallback is the rarer path. Will be picked up in a follow-up if the worker upgrade path also gets a progress channel. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): InstallBackend tolerates silent (pre-Phase-2) workers A worker on pre-Phase-2 code never publishes progress events. The new master subscribes optimistically; this spec pins that a silent worker still produces a green install with no progressCb ticks. The install reply is the source of truth for terminal state; the progress stream is a best-effort UX enrichment. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document install progress streaming Note the new nodes.<nodeID>.backend.install.<opID>.progress subject and the silent-worker compatibility behavior so operators know to expect real-time progress and what happens on a mixed-version cluster. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): note progress-event ordering trade-off in InstallBackend Document near the goroutine dispatch why ordering at the consumer is best-effort, why it rarely matters in practice (worker debounce >> goroutine jitter), and what a future hardening pass would look like (Seq field + stale-by-seq drop). Stops the next reader from accidentally "fixing" the goroutine pool away. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): add NodeProgress + OpStatus.Nodes for per-node breakdown Adds the data model the UI needs to render an expandable per-node breakdown of a fanned-out backend install. NodeProgress carries node identity (ID + name), per-node status (queued / running_on_worker / success / error / downloading), the current file + bytes + percentage from the Phase 2 progress stream, and any per-node error. OpStatus.Nodes is the slice the /api/operations handler will surface in a follow-up. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): UpdateNodeProgress merges per-node ticks by NodeID GalleryService.UpdateNodeProgress(opID, nodeID, np) merges a NodeProgress into OpStatus.Nodes (keyed by NodeID, no duplicates) and mirrors the latest tick into the aggregate Progress / FileName / DownloadedFileSize / TotalFileSize fields so the legacy single-bar OperationsBar view keeps working unchanged alongside the new per-node breakdown. Concurrent-safe via the existing g.Mutex. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): write per-node OpStatus entries during install fan-out DistributedBackendManager now accepts a nodeProgressSink and feeds it two streams: 1. enqueueAndDrainBackendOp emits a per-node terminal entry on each status it appends to BackendOpResult (queued, success, error, running_on_worker). The opID is threaded through the function so the sink gets the right gallery op identity. 2. The install apply closure fans each BackendInstallProgressEvent into the sink as a downloading entry, alongside the legacy progressCb path so the aggregate single-bar view stays correct. Production wiring passes the GalleryService (which implements UpdateNodeProgress via Task 2) as the sink. Single-node tests pass nil. DeleteBackend and UpgradeBackend pass an empty opID so the sink path no-ops for ops that aren't gallery-tracked the same way as Install. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(operations): expose per-node breakdown on /api/operations When an operation's OpStatus has Nodes entries (populated by the Phase 4 progress sink wiring), surface them as a "nodes" array on the /api/operations response, sorted by node_name for stable rendering. Backward compatible: legacy clients ignore the field; ops without any node entries (single-node mode, model installs) omit the array entirely thanks to the empty-slice guard. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): per-node breakdown in OperationsBar When an install op fans out to more than one worker, the operations bar now shows a "N nodes" chevron that expands into a per-node list. Each row carries the node's status (color-coded pill), the current file being downloaded, byte counts, percentage, and a thin per-node progress bar. Yellow "Worker busy" pill marks running_on_worker status with a tooltip explaining the NATS round-trip timed out but the worker is still installing in the background. Backward compatible: ops without a nodes field (legacy or single-node mode) render as before. State for expand/collapse is local to the component, keyed by jobID/id - reload starts collapsed. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document per-node breakdown in the operations bar Adds a short subsection covering the expandable "N nodes" chevron in the OperationsBar admin UI, the meaning of each status pill, and how it relates to the /api/operations nodes array. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(galleryop): UpdateStatus preserves Nodes when caller sends none Real-world bug surfaced by the Phase 4 multi-worker smoke test: the nodes[] array in /api/operations flickered between a single node at a time on a 2-worker install. Root cause: the Phase 2 progress bridge also calls the legacy progressCb -> UpdateStatus(&OpStatus{...}) on every tick. UpdateStatus then overwrote the entire status pointer, wiping the Nodes slice that UpdateNodeProgress had just merged in. Fix: in UpdateStatus, if the incoming op has an empty Nodes slice, carry forward the previous status's Nodes before storing. Callers that explicitly populate Nodes still win (their slice replaces the prior one, no merge across the two code paths). Two regression specs added pinning both directions of the contract. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): strip implementation details from user-facing docs Trim the new install/upgrade timeout rows and the install-progress sections to focus on what the operator sees and tunes. Drops: - the NATS subject names and pub/sub mechanics - "round-trip" / reconciler / backend.list jargon - /api/operations polling cadence - "pre-2026-05-22" version references Reframes the breakdown text around the admin UI (Operations Bar, chevron, status pills, "Worker busy" tooltip). Implementation context lives in the agent notes and code comments. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(config): move DistributedConfig.Validate flag names to constants The negative-duration check map was a wall of literal kebab-case strings that had to stay in sync with the kong-derived CLI flag names manually. Move them to a Flag* const block alongside the existing Default* block so a rename of either the Go field or the CLI naming convention forces a compile error rather than silent drift. Sole consumer today is Validate; the constants are exported so future operator-facing surfaces (e.g. error messages on other validation paths) can reference them by name instead of repeating the literals. Tests pin both the literal values (so a future "let's just rename this" doesn't accidentally regress the CLI flag) and the negative- duration error message for the new BackendInstall / BackendUpgrade fields. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(distributed): extract NodeStatus and Phase enums to constants Sweep for the same literal-string-as-identifier pattern called out on the Validate flag names: the per-node install status enum ("queued" | "downloading" | "running_on_worker" | "success" | "error") appeared as raw literals across managers_distributed.go (10+ sites, including 3 separate `n.Status == "running_on_worker"` checks), operation.go, and the test suite. Same shape for the Phase enum ("resolving" | "downloading" | "extracting" | "starting") in the worker-side progress publisher. Promote both to exported const blocks: - galleryop.NodeStatus{Queued,Downloading,RunningOnWorker,Success,Error} shared between galleryop.NodeProgress.Status (the wire field) and nodes.NodeOpStatus.Status (the in-process per-node summary) - messaging.Phase{Resolving,Downloading,Extracting,Starting} shared between the worker publisher and any future consumer that needs to switch on phase Tests pin both the literal values (so a future "let's just rename" doesn't silently change the JSON wire) and use the constants in setup (so the producer side stays drift-protected). Wire-format assertions on the /api/operations JSON output keep their literals deliberately, so the constant value can never silently diverge from what the UI receives. Out of scope for this PR (separate cleanup): the finetune and quantization job-status enums have the same anti-pattern with 14+ literal sites each, but predate this PR's work. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
e5d7b84216 |
fix(distributed): split NATS backend.upgrade off install + dedup loads (#9717)
* feat(messaging): add backend.upgrade NATS subject + payload types
Splits the slow force-reinstall path off backend.install so it can run on
its own subscription goroutine, eliminating head-of-line blocking between
routine model loads and full gallery upgrades.
Wire-level Force flag on BackendInstallRequest is kept for one release as
the rolling-update fallback target; doc note marks it deprecated.
Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed/worker): add per-backend mutex helper to backendSupervisor
Different backend names lock independently; same backend serializes. This
is the synchronization primitive used by the upcoming concurrent install
handler — without it, wrapping the NATS callback in a goroutine would
race the gallery directory when two requests target the same backend.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(distributed/worker): run backend.install handler in a goroutine
NATS subscriptions deliver messages serially on a single per-subscription
goroutine. With a synchronous install handler, a multi-minute gallery
download would head-of-line-block every other install request to the
same worker — manifesting upstream as a 5-minute "nats: timeout" on
unrelated routine model loads.
The body now runs in its own goroutine, with a per-backend mutex
(lockBackend) protecting the gallery directory from concurrent operations
on the same backend. Different backend names install in parallel.
Backward-compat: req.Force=true is still honored here, so an older master
that hasn't been updated to send on backend.upgrade keeps working.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed/worker): subscribe to backend.upgrade as a separate path
Slow force-reinstall now lives on its own NATS subscription, so a
multi-minute gallery pull cannot head-of-line-block the routine
backend.install handler on the same worker. Same per-backend mutex
guards both — concurrent install + upgrade for the same backend
serialize at the gallery directory; different backends are independent.
upgradeBackend stops every live process for the backend, force-installs
from gallery, and re-registers. It does not start a new process — the
next backend.install will spawn one with the freshly-pulled binary.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): add UpgradeBackend on NodeCommandSender; drop Force from InstallBackend
Master now sends to backend.upgrade for force-reinstall, with a
nats.ErrNoResponders fallback to the legacy backend.install Force=true
path so a rolling update with a new master + an old worker still
converges. The Force parameter leaves the public Go API surface
entirely — only the internal fallback sets it on the wire.
InstallBackend timeout drops 5min -> 3min (most replies are sub-second
since the worker short-circuits on already-running or already-installed).
UpgradeBackend timeout is 15min, sized for real-world Jetson-on-WiFi
gallery pulls.
Updates the admin install HTTP endpoint
(core/http/endpoints/localai/nodes.go) to the new signature too.
router_test.go's fakeUnloader does not yet implement the new interface
shape; Task 3.2 will catch it up before the next package-level test run.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(distributed): update fakeUnloader for new NodeCommandSender shape
InstallBackend lost its force bool param (Force is not part of the public
Go API anymore — only the internal upgrade-fallback path sets it on the
wire). UpgradeBackend gained a method. Fake records both call slices and
provides an installHook concurrency seam for upcoming singleflight tests.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(distributed): cover UpgradeBackend's new subject + rolling-update fallback
Task 3.1 changed the master to publish UpgradeBackend on the new
backend.upgrade subject; the existing UpgradeBackend tests scripted the
old install subject and so all 3 began failing as expected. Updates them
to script SubjectNodeBackendUpgrade with BackendUpgradeReply.
Adds two new specs for the rolling-update fallback:
- ErrNoResponders on backend.upgrade triggers a backend.install
Force=true retry on the same node.
- Non-NoResponders errors propagate to the caller unchanged.
scriptedMessagingClient gains scriptNoResponders (real nats sentinel) and
scriptReplyMatching (predicate-matched canned reply, used to assert that
the fallback path actually sets Force=true on the install retry).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(distributed): coalesce concurrent identical backend.install via singleflight
Six simultaneous chat completions for the same not-yet-loaded model were
observed firing six independent NATS install requests, each serializing
through the worker's per-subscription goroutine and amplifying queue
depth. SmartRouter now wraps the NATS round-trip in a singleflight.Group
keyed by (nodeID, backend, modelID, replica): N concurrent identical
loads share one round-trip and one reply.
Distinct (modelID, replica) keys still fire independent calls, so
multi-replica scaling and multi-model fan-out are unaffected.
fakeUnloader gains a sync.Mutex around its recording slices to keep
concurrent test goroutines race-clean.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(e2e/distributed): drop force arg from InstallBackend test calls
Two e2e test call sites still passed the trailing force bool that was
removed from RemoteUnloaderAdapter.InstallBackend in
|
||
|
|
2be07f61da |
feat(whisper): honor client cancellation via ggml abort_callback (#9710)
* refactor(transcription): propagate request ctx through ModelTranscription* Replaces context.Background() with the HTTP request ctx so client disconnects start cancelling the gRPC call. No backend-side abort wiring yet — that comes in a later commit. Pure plumbing. Assisted-by: Claude:claude-haiku-4-5 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(cli): pass ctx to backend.ModelTranscription Follow-up to |
||
|
|
447c186089 |
fix(distributed): make backend upgrade actually re-install on workers (#9708)
* fix(distributed): make backend upgrade actually re-install on workers UpgradeBackend dispatched a vanilla backend.install NATS event to every node hosting the backend. The worker's installBackend short-circuits on "already running for this (model, replica) slot" and returns the existing address — so the gallery install path was skipped, no artifact was re-downloaded, no metadata was written. The frontend's drift detection then re-flagged the same backends every cycle (installedDigest stays empty → mismatch → "Backend upgrade available (new build)") while "Backend upgraded successfully" landed in the logs at the same time. The user-visible symptom: clicking "Upgrade All" silently does nothing and the same N backends sit on the upgrade list forever. Two coupled fixes, one PR: 1. Force flag on backend.install. Add `Force bool` to BackendInstallRequest and thread it through NodeCommandSender -> RemoteUnloaderAdapter. UpgradeBackend (and the reconciler's pending-op drain when retrying an upgrade) sets force=true; routine load events and admin install endpoints keep force=false. On the worker, force=true stops every live process that uses this backend (resolveProcessKeys for peer replicas, plus the exact request processKey), skips the findBackend short-circuit, and passes force=true into gallery.InstallBackendFromGallery so the on-disk artifact is overwritten. After the gallery install completes, startBackend brings up a fresh process at the same processKey on a new port. 2. Liveness check on the fast path. installBackend's "already running" branch read getAddr without verifying the process was alive, so a gRPC backend that died without the supervisor noticing left a stale (key, addr) entry. The reconciler then dialed that address, got ECONNREFUSED, marked the replica failed, retried install — and the supervisor said "already running addr=…" again. Loop forever, exactly what we observed on a node whose llama-cpp process had died but whose supervisor record persisted. Verify s.isRunning(processKey) before trusting getAddr; if the entry is stale, stopBackendExact cleans up and we fall through to a real install. Backwards-compatible: the new Force field is omitempty, older workers ignore it (their default behavior matches force=false). The signature change on NodeCommandSender.InstallBackend is internal-only. Verified: unit tests in core/services/nodes pass (108s suite). The pre-existing core/backend build break (proto regen pending for word-level timestamps) blocks core/cli and core/http/endpoints/localai package tests but is unrelated to this change. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * test(e2e/distributed): pass force=false to adapter.InstallBackend NodeCommandSender.InstallBackend gained a final force bool in the upgrade-force commit; the e2e distributed lifecycle tests still called the old 8-arg signature and broke compilation. These tests exercise the routine install path (single replica, default behavior), so force=false preserves their existing semantics. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
70cf8ac546 |
fix(backend): resolve relative draft_model paths against the models dir (#9680)
* fix(backend): resolve relative draft_model paths against the models dir The main model file and mmproj are joined with the configured models directory before reaching the backend, but draft_model was sent verbatim. With a relative draft_model in the YAML config, llama.cpp opens the path from the backend process's CWD and fails with "No such file or directory", forcing users to hard-code an absolute path. Mirror the existing mmproj resolution: if draft_model is relative, join it with modelPath. Absolute paths are passed through unchanged. Adds an e2e regression test against the mock backend that asserts the main model file, mmproj, and draft_model all arrive at the backend resolved to absolute paths. Closes #9675 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write] * fix(backend): always join draft_model with models dir (drop IsAbs shortcut) The previous commit kept absolute draft_model paths intact via an IsAbs check. That left a path-traversal vector open: a user-supplied YAML config could set draft_model to /etc/passwd (or any other host file the backend process can read) and the path would be sent through unchanged. filepath.Join cleans the leading slash from absolute components, so joining unconditionally — the way mmproj already does — keeps the result rooted at the configured models directory regardless of input. Adds a second e2e spec that feeds an absolute draft_model into the mock backend and asserts the path is clamped under modelsPath. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7-1m [Read] [Edit] [Bash] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
8e43842175 |
feat(vllm, distributed): tensor parallel distributed workers (#9612)
* feat(vllm): build vllm from source for Intel XPU
Upstream publishes no XPU wheels for vllm. The Intel profile was
silently picking up a non-XPU wheel that imported but errored at
engine init, and several runtime deps (pillow, charset-normalizer,
chardet) were missing on Intel -- backend.py crashed at import time
before the gRPC server came up.
Switch the Intel profile to upstream's documented from-source
procedure (docs/getting_started/installation/gpu.xpu.inc.md in
vllm-project/vllm):
- Bump portable Python to 3.12 -- vllm-xpu-kernels ships only a
cp312 wheel.
- Source /opt/intel/oneapi/setvars.sh so vllm's CMake build sees
the dpcpp/sycl compiler from the oneapi-basekit base image.
- Hide requirements-intel-after.txt during installRequirements
(it used to 'pip install vllm'); install vllm's deps from a
fresh git clone of vllm via 'uv pip install -r
requirements/xpu.txt', swap stock triton for
triton-xpu==3.7.0, then 'VLLM_TARGET_DEVICE=xpu uv pip install
--no-deps .'.
- requirements-intel.txt trimmed to LocalAI's direct deps
(accelerate / transformers / bitsandbytes); torch-xpu, vllm,
vllm_xpu_kernels and the rest come from upstream's xpu.txt
during the source build.
- requirements.txt: add pillow + charset-normalizer + chardet --
used by backend.py and missing on the Intel install profile.
- run.sh: 'set -x' so backend startup is visible in container
logs (the gRPC startup error path was previously opaque).
Also adds a one-line docs example for engine_args.attention_backend
under the vLLM section, since older XE-HPG GPUs (e.g. Arc A770)
need TRITON_ATTN to bypass the cutlass path in vllm_xpu_kernels.
Tested end-to-end on an Intel Arc A770 with Qwen2.5-0.5B-Instruct
via LocalAI's /v1/chat/completions.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(vllm): add multi-node data-parallel follower worker
vLLM v1's multi-node story is one process per node sharing a DP
coordinator over ZMQ -- the head runs the API server with
data_parallel_size > 1 and followers run `vllm serve --headless ...`
with matching topology. Today LocalAI can already configure DP on the
head via the engine_args YAML map, but there's no way to bring up the
follower nodes -- so the head sits waiting for ranks that never
handshake.
Add `local-ai p2p-worker vllm`, mirroring MLXDistributed's structural
precedent (operator-launched, static config, no NATS placement). The
worker:
- Optionally self-registers with the frontend as an agent-type node
tagged `node.role=vllm-follower` so it's visible in the admin UI
and operators can scope ordinary models away via inverse
selectors.
- Resolves the platform-specific vllm backend via the gallery's
"vllm" meta-entry (cuda*, intel-vllm, rocm-vllm, ...).
- Runs vLLM as a child process so the heartbeat goroutine survives
until vLLM exits; forwards SIGINT/SIGTERM so vLLM can clean up its
ZMQ sockets before we tear down.
- Validates --headless + --start-rank 0 is rejected (rank 0 is the
head and must serve the API).
Backend run.sh dispatches `serve` as the first arg to vllm's own CLI
instead of LocalAI's backend.py gRPC server -- the follower speaks
ZMQ directly to the head, there is no LocalAI gRPC on the follower
side. Single-node usage is unchanged.
Generalises the gallery resolution helper into findBackendPath()
shared by MLX and vLLM workers; extracts ParseNodeLabels for the
comma-separated label parsing both use.
Ships with two compose recipes (`docker-compose.vllm-multinode.yaml`
for NVIDIA, `docker-compose.vllm-multinode.intel.yaml` for Intel
XPU/xccl) plus `tests/e2e/vllm-multinode/smoke.sh`. Both vendors are
supported (NCCL for CUDA/ROCm, xccl for XPU) but mixed-vendor DP is
not -- PyTorch's process group requires every rank to use the same
collective backend, and NCCL/xccl/gloo don't interoperate.
Out of scope (deferred): SmartRouter-driven placement of follower
ranks via NATS backend.install events, follower log streaming through
/api/backend-logs, tensor-parallel across nodes, disaggregated
prefill via KVTransferConfig.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* test(vllm): CPU-only end-to-end test for multi-node DP
Adds tests/e2e/vllm-multinode/, a Ginkgo + testcontainers-go suite
that brings up a head + headless follower from the locally-built
local-ai:tests image, bind-mounts the cpu-vllm backend extracted by
make extract-backend-vllm so it's seen as a system backend (no gallery
fetch, no registry server), and asserts a chat completion across both
DP ranks. New `make test-e2e-vllm-multinode` target wires the docker
build, backend extract, and ginkgo run together; BuildKit caches both
images so re-runs only rebuild what changed. Tagged Label("VLLMMultinode")
so the existing distributed suite isn't pulled along.
Two pre-existing bugs surfaced by the test:
1. extract-backend-% (Makefile) failed for every backend, because all
backend images end with `FROM scratch` and `docker create` rejects
an image with no CMD/ENTRYPOINT. Fixed by passing
--entrypoint=/run.sh -- the container is never started, only
docker-cp'd, so the path doesn't have to exist; we just need
anything that satisfies the daemon's create-time validation.
2. backend/python/vllm/run.sh's `serve` shortcut for the multi-node DP
follower exec'd ${EDIR}/venv/bin/vllm directly, but uv bakes an
absolute build-time shebang (`#!/vllm/venv/bin/python3`) that no
longer resolves once the backend is relocated to BackendsPath.
_makeVenvPortable's shebang rewriter only matches paths that
already point at ${EDIR}, so the original shebang slips through
unchanged. Fixed by exec-ing ${EDIR}/venv/bin/python with the script
as an argument -- Python ignores the script's shebang in that case.
The test fixture caps memory aggressively (max_model_len=512,
VLLM_CPU_KVCACHE_SPACE=1, TORCH_COMPILE_DISABLE=1) so two CPU engines
fit on a 32 GB box. TORCH_COMPILE_DISABLE is currently mandatory for
cpu-vllm: torch._inductor's CPU-ISA probe runs even with
enforce_eager=True and needs g++ on PATH, which the LocalAI runtime
image doesn't ship -- to be addressed in a follow-up that bundles a
toolchain in the cpu-vllm backend.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(vllm): bundle a g++ toolchain in the cpu-vllm backend image
torch._inductor's CPU-ISA probe (`cpu_model_runner.py:65 "Warming up
model for the compilation"`) shells out to `g++` at vllm engine
startup, regardless of `enforce_eager=True` -- the eager flag only
disables CUDA graphs, not inductor's first-batch warmup. The LocalAI
CPU runtime image (Dockerfile, unconditional apt list) does not ship
build-essential, and the cpu-vllm backend image is `FROM scratch`,
so any non-trivial inference on cpu-vllm crashes with:
torch._inductor.exc.InductorError:
InvalidCxxCompiler: No working C++ compiler found in
torch._inductor.config.cpp.cxx: (None, 'g++')
Bundling the toolchain in the CPU runtime image would bloat every
non-vllm-CPU deployment and force a single GCC version on backends
that may want clang or a different version. So this lives in the
backend, gated to BUILD_TYPE=='' (the CPU profile).
`package.sh` snapshots g++ + binutils + cc1plus + libstdc++ + libc6
(runtime + dev) + the math libs cc1plus links (libisl/libmpc/libmpfr/
libjansson) into ${BACKEND}/toolchain/, mirroring /usr/... layout. The
unversioned binaries on Debian/Ubuntu are symlink chains pointing into
multiarch packages (`g++` -> `g++-13` -> `x86_64-linux-gnu-g++-13`,
the latter in `g++-13-x86-64-linux-gnu`), so the package list resolves
both the version and the arch-triplet variant. Symlinks /lib ->
usr/lib and /lib64 -> usr/lib64 are recreated under the toolchain
root because Ubuntu's UsrMerge keeps them at /, and ld scripts
(`libc.so`, `libm.so`) hardcode `/lib/...` paths that --sysroot
re-roots into the toolchain.
The unversioned `g++`/`gcc`/`cpp` symlinks are replaced with wrapper
shell scripts that resolve their own location at runtime and pass
`--sysroot=<toolchain>` and `-B <toolchain>/usr/lib/gcc/<triplet>/<ver>/`
to the underlying versioned binary. That's how torch's bare `g++ foo.cpp
-o foo` invocation finds cc1plus (-B), system headers (--sysroot), and
the bundled libstdc++ (--sysroot, --sysroot is recursive into linker).
`run.sh` adds the toolchain bin dir to PATH and the toolchain's
shared-lib dir to LD_LIBRARY_PATH -- everything else (header search,
linker search, executable search) is encapsulated in the wrappers.
No-op for non-CPU builds, the dir doesn't exist there.
The cpu-vllm image grows by ~217 MB. Tradeoff is acceptable -- cpu-vllm
is already a niche profile (few users compared to GPU vllm) and the
alternative is a backend that crashes at first inference unless the
operator manually sets TORCH_COMPILE_DISABLE=1, which silently disables
all torch.compile optimizations.
Drops `TORCH_COMPILE_DISABLE=1` from tests/e2e/vllm-multinode -- the
smoke now exercises the real compile path through the bundled toolchain.
Test runtime is +20s for the warmup compile, still <90s end to end.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(vllm): scope jetson-ai-lab index to L4T-specific wheels via pyproject.toml
The L4T arm64 build resolves dependencies through pypi.jetson-ai-lab.io,
which hosts the L4T-specific torch / vllm / flash-attn wheels but also
transparently proxies the rest of PyPI through `/+f/<sha>/<filename>`
URLs. With `--extra-index-url` + `--index-strategy=unsafe-best-match`
uv would pick those proxy URLs for ordinary PyPI packages —
anthropic/openai/propcache/annotated-types — and fail when the proxy
503s. Master is hitting the same bug on its own l4t-vllm matrix entry.
Switch the l4t13 install path to a pyproject.toml that marks the
jetson-ai-lab index `explicit = true` and pins only torch, torchvision,
torchaudio, flash-attn, and vllm to it via [tool.uv.sources]. uv won't
consult the L4T mirror for anything else, so transitive deps fall back
to PyPI as the default index — no exposure to the proxy 503s.
`uv pip install -r requirements.txt` ignores [tool.uv.sources], so the
l4t13 branch in install.sh now invokes `uv pip install --requirement
pyproject.toml` directly, replacing the old requirements-l4t13*.txt
files. Other BUILD_PROFILEs continue using libbackend.sh's
installRequirements and never read pyproject.toml.
Local resolution test (x86_64, dry-run) confirms uv hits the L4T
index for torch and falls through to PyPI for everything else.
Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
|
||
|
|
e86ade54a6 |
feat(api): add /v1/audio/diarization endpoint with sherpa-onnx + vibevoice.cpp (#9654)
* feat(api): add /v1/audio/diarization endpoint with sherpa-onnx + vibevoice.cpp
Closes #1648.
OpenAI-style multipart endpoint that returns "who spoke when". Single
endpoint instead of the issue's three-endpoint sketch (refactor /vad,
/vad/embedding, /diarization) — the typical client wants one call, and
embeddings can land later as a sibling without breaking this surface.
Response shape borrows from Pyannote/Deepgram: segments carry a
normalised SPEAKER_NN id (zero-padded, stable across the response) plus
the raw backend label, optional per-segment text when the backend bundles
ASR, and a speakers summary in verbose_json. response_format also accepts
rttm so consumers can pipe straight into pyannote.metrics / dscore.
Backends:
* vibevoice-cpp — Diarize() reuses the existing vv_capi_asr pass.
vibevoice's ASR prompt asks the model to emit
[{Start,End,Speaker,Content}] natively, so diarization is a by-product
of the same pass; include_text=true preserves the transcript per
segment, otherwise we drop it.
* sherpa-onnx — wraps the upstream SherpaOnnxOfflineSpeakerDiarization
C API (pyannote segmentation + speaker-embedding extractor + fast
clustering). libsherpa-shim grew config builders, a SetClustering
wrapper for per-call num_clusters/threshold overrides, and a
segment_at accessor (purego can't read field arrays out of
SherpaOnnxOfflineSpeakerDiarizationSegment[] directly).
Plumbing: new Diarize gRPC RPC + DiarizeRequest / DiarizeSegment /
DiarizeResponse messages, threaded through interface.go, base, server,
client, embed. Default Base impl returns unimplemented.
Capability surfaces all updated: FLAG_DIARIZATION usecase,
FeatureAudioDiarization permission (default-on), RouteFeatureRegistry
entries for /v1/audio/diarization and /audio/diarization, audio
instruction-def description widened, CAP_DIARIZATION JS symbol,
swagger regenerated, /api/instructions discovery map updated.
Tests:
* core/backend: speaker-label normalisation (first-seen → SPEAKER_NN,
per-speaker totals, nil-safety, fallback to backend NumSpeakers when
no segments).
* core/http/endpoints/openai: RTTM rendering (file-id basename, negative
duration clamping, fallback id).
* tests/e2e: mock-backend grew a deterministic Diarize that emits
raw labels "5","2","5" so the e2e suite verifies SPEAKER_NN
remapping, verbose_json speakers summary + transcript pass-through
(gated by include_text), RTTM bytes content-type, and rejection of
unknown response_format. mock-diarize model config registered with
known_usecases=[FLAG_DIARIZATION] to bypass the backend-name guard.
Docs: new features/audio-diarization.md (request/response, RTTM example,
sherpa-onnx + vibevoice setup), cross-link from audio-to-text.md, entry
in whats-new.md.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(diarization): correct sherpa-onnx symbol name + lint cleanup
CI failures on #9654:
* sherpa-onnx-grpc-{tts,transcription} and sherpa-onnx-realtime panicked
at backend startup with `undefined symbol: SherpaOnnxDestroyOfflineSpeakerDiarizationResult`.
Upstream's actual symbol is SherpaOnnxOfflineSpeakerDiarizationDestroyResult
(Destroy in the middle, not the prefix); the rest of the diarization
surface follows the same naming pattern. The mismatched name made
purego.RegisterLibFunc fail at dlopen time and crashed the gRPC server
before the BeforeAll could probe Health, taking down every sherpa-onnx
test job — not just the diarization-related ones.
* golangci-lint flagged 5 errcheck violations on new defer cleanups
(os.RemoveAll / Close / conn.Close); wrap each in a `defer func() { _ = X() }()`
closure (matches the pattern other LocalAI files use for new code, since
pre-existing bare defers are grandfathered in via new-from-merge-base).
* golangci-lint also flagged forbidigo violations: the new
diarization_test.go files used testing.T-style `t.Errorf` / `t.Fatalf`,
which are forbidden by the project's coding-style policy
(.agents/coding-style.md). Convert both files to Ginkgo/Gomega
Describe/It with Expect(...) — they get picked up by the existing
TestBackend / TestOpenAI suites, no new suite plumbing needed.
* modernize linter: tightened the diarization segment loop to
`for i := range int(numSegments)` (Go 1.22+ idiom).
Verified locally: golangci-lint with new-from-merge-base=origin/master
reports 0 issues across all touched packages, and the four mocked
diarization e2e specs in tests/e2e/mock_backend_test.go still pass.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(vibevoice-cpp): convert non-WAV input via ffmpeg + raise ASR token budget
Confirmed end-to-end against a real LocalAI instance with vibevoice-asr-q4_k
loaded and the multi-speaker MP3 sample at vibevoice.cpp/samples/2p_argument.mp3:
both /v1/audio/transcriptions and /v1/audio/diarization now succeed and
return correctly attributed speaker turns for the full clip.
Two latent issues surfaced once the diarization endpoint actually exercised
the backend with a non-trivial input:
1. vv_capi_asr only accepts WAV via load_wav_24k_mono. The previous code
passed the uploaded path straight through, so anything that wasn't
already a 24 kHz mono s16le WAV failed at the C side with rc=-8 and
the very unhelpful "vv_capi_asr failed". prepareWavInput shells out
to ffmpeg ("-ar 24000 -ac 1 -acodec pcm_s16le") in a per-call temp
dir, matching the rate the model was trained on; both AudioTranscription
and Diarize now route through it. This is the same shape sherpa-onnx
uses (utils.AudioToWav), but vibevoice needs 24 kHz rather than 16 kHz
so we don't reuse that helper.
2. The C ABI's max_new_tokens defaults to 256 when 0 is passed. That's
fine for a five-second clip but not for anything past ~10 s — vibevoice
stops mid-JSON, the parse fails, and the caller sees a hard error.
Pass a much larger budget (16 384 ≈ ~9 minutes of speech at the
model's ~30 tok/s rate); generation stops at EOS so this is a cap
rather than a target.
3. As a defensive belt-and-braces, mirror AudioTranscription's existing
"fall back to a single segment if the model emits non-JSON text"
pattern in Diarize, so partial / unusual model output never produces
a 500. This kept the endpoint usable while diagnosing (1) and (2),
and is the right behaviour to keep.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(vibevoice-cpp): pass valid WAVs through directly so ffmpeg is not required at runtime
Spotted by tests-e2e-backend (1.25.x): the previous fix forced every
incoming audio file through `ffmpeg -ar 24000 ...`, which meant the
backend container — which does not ship ffmpeg — failed even for the
existing happy path where the caller already uploads a WAV. The
container-side error was:
rpc error: code = Unknown desc = vibevoice-cpp: ffmpeg convert to
24k mono wav: exec: "ffmpeg": executable file not found in $PATH
Reading vibevoice.cpp's audio_io.cpp, `load_wav_24k_mono` uses drwav and
already accepts any PCM/IEEE-float WAV at any sample rate, downmixes
multi-channel input to mono, and resamples to 24 kHz internally. So the
only inputs that genuinely need an external converter are non-WAV
formats (MP3, OGG, FLAC, ...).
Detect WAVs by RIFF/WAVE magic at bytes 0..3 / 8..11 and pass them
straight through with a no-op cleanup; everything else still goes
through ffmpeg with the same 24 kHz mono s16le target. The result:
* Container builds without ffmpeg keep working for WAV uploads
(the e2e-backends fixture is jfk.wav at 16 kHz mono s16le).
* MP3 and other non-WAV inputs still get the new ffmpeg conversion
path so the diarization endpoint stays useful.
* If the caller uploads a non-WAV but ffmpeg isn't on PATH, the
surfaced error is still descriptive enough to act on.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(ci): make gcc-14 install in Dockerfile.golang best-effort for jammy bases
The LocalVQE PR (
|
||
|
|
170d55c67d |
fix(distributed): honor NodeSelector in cached-replica lookup, stop empty-backend reconciler scaleups (#9652)
* fix(distributed): honor NodeSelector in cached-replica lookup, stop empty-backend reconciler scaleups
Two distinct bugs were causing tight retry loops in the distributed scheduler:
1. FindAndLockNodeWithModel ignored the model's NodeSelector. When a model
was loaded on multiple nodes and only some matched the current selector,
the function returned the lowest-in_flight node — even one the selector
excluded. Route()'s post-check then fell through to scheduleNewModel,
which targeted the matching node where the model was already at
MaxReplicasPerModel capacity. Eviction couldn't help (the only loaded
model on that node was the one being requested, and it was busy), so
every request looped through "evicting LRU" → "all models busy".
Fix: thread an optional candidateNodeIDs filter through
FindAndLockNodeWithModel. Route() resolves the selector once via a new
resolveSelectorCandidates helper and passes the matching IDs to both
the cached-replica lookup and scheduleNewModel. The same helper
replaces the inline selector block in scheduleNewModel.
2. ScheduleAndLoadModel (reconciler scale-up path) fell back to
scheduleNewModel with backendType="" when no replica had ever been
loaded for a model. The worker rejected the resulting backend.install
("backend name is empty") on every reconciler tick (~30s).
Fix: remove the broken fallback. When GetModelLoadInfo has nothing
stored, return a clear error instead of firing a doomed NATS install.
The reconciler's existing scale-up failure log surfaces it once per
tick; the model auto-replicates as soon as Route() serves it once and
stores load info.
Also downgrade the post-LoadModel-failure StopGRPC error to Debug — that
cleanup attempt usually hits "model not found" because LoadModel failed
before registering the process, and the outer "Failed to load model"
error already carries the real reason.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]
* test(distributed): cover selector-aware FindAndLockNodeWithModel and reconciler scaleup guard
Two regression tests for the bugs fixed in the previous commit:
1. FindAndLockNodeWithModel — registry-level integration tests verify the
candidateNodeIDs filter:
- Returns the included node even when an excluded node has lower
in_flight (the original selector-mismatch loop scenario).
- Returns not-found when the model is loaded only on excluded nodes,
forcing Route() to fall through to a fresh schedule instead of
reusing the excluded replica.
2. ScheduleAndLoadModel — mock-based test verifies the reconciler scale-up
path returns an error and does NOT fire backend.install when no replica
has been loaded yet. fakeUnloader gains an installCalls slice so this
negative assertion is direct.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
6b63b47f61 |
feat(distributed): support multiple replicas of one model on the same node (#9583)
* feat(distributed): support multiple replicas of one model on the same node The distributed scheduler implicitly assumed `(node_id, model_name)` was unique, but the schema didn't enforce it and the worker keyed all gRPC processes by model name alone. With `MinReplicas=2` against a single worker, the reconciler "scaled up" every 30s but the registry never advanced past 1 row — the worker re-loaded the model in-place every tick until VRAM fragmented and the gRPC process died. This change introduces multi-replica-per-node as a first-class concept, with capacity-aware scheduling, a circuit breaker, and VRAM soft-reservation. Operators can declare per-node capacity via the worker flag `--max-replicas-per-model` (mirrored as auto-label `node.replica-slots=N`) or override per-node from the UI. * Schema: BackendNode gains MaxReplicasPerModel (default 1) and ReservedVRAM. NodeModel gains ReplicaIndex (composite with node_id + model_name). ModelSchedulingConfig gains UnsatisfiableUntil/Ticks for the reconciler circuit breaker. * Registry: replica_index threaded through SetNodeModel, RemoveNodeModel, IncrementInFlight, DecrementInFlight, TouchNodeModel, GetNodeModel, SetNodeModelLoadInfo and the InFlightTrackingClient. New helpers: CountReplicasOnNode, NextFreeReplicaIndex (with ErrNoFreeSlot), RemoveAllNodeModelReplicas, FindNodesWithFreeSlot, ClusterCapacityForModel, ReserveVRAM/ReleaseVRAM (atomic UPDATE with ErrInsufficientVRAM), and the unsatisfiable-flag CRUD. * Worker: processKey now `<modelID>#<replicaIndex>` so concurrent loads of the same model land on distinct ports. Adds CLI flag --max-replicas-per-model (env LOCALAI_MAX_REPLICAS_PER_MODEL, default 1) and emits the auto-label. * Router: scheduleNewModel filters candidates by free slot, allocates the replica index, and soft-reserves VRAM before installing the backend. evictLRUAndFreeNode now deletes the targeted row by ID instead of all replicas of the model on the node — fixes a latent bug where evicting one replica orphaned its siblings. * Reconciler: caps scale-up at ClusterCapacityForModel so a misconfig (MinReplicas > capacity) doesn't loop forever. After 3 consecutive ticks of capacity==0 it sets UnsatisfiableUntil for a 5m cooldown and emits a warning. ClearAllUnsatisfiable fires from Register, ApproveNode, SetNodeLabel(s), RemoveNodeLabel and UpdateMaxReplicasPerModel so a new node joining or label changes wake the reconciler immediately. scaleDownIdle removes highest-replica-index first to keep slots compact. * Heartbeat resets reserved_vram to 0 — worker is the source of truth for actual free VRAM; the reservation is only for the in-tick race window between two scheduling decisions. * Probe path (reconciler.probeLoadedModels and health.doCheckAll) now pass the row's replica_index to RemoveNodeModel so an unreachable replica doesn't orphan healthy siblings. * Admin override: PUT /api/nodes/:id/max-replicas-per-model sets a sticky override (preserved across worker re-registration). DELETE clears the override so the worker's flag applies again on next register. Required because Kong defaults the worker flag to 1, so every worker restart would have silently reverted the UI value. * React UI: always-visible slot badge on the node row (muted at default 1, accented when >1); inline editor in the expanded drawer with pencil-to-edit, Save/Cancel, Esc/Enter, "(override)" indicator when the value is admin-set, and a "Reset" button to hand control back to the worker. Soft confirm when shrinking the cap below the count of loaded replicas. Scheduling rules table gets an "Unsatisfiable until HH:MM" status badge surfacing the cooldown. * node.replica-slots filtered out of the labels strip on the row to avoid duplicating the slot badge. 23 new Ginkgo specs (registry, reconciler, inflight, health) cover: multi-replica row independence, RemoveNodeModel of one replica preserving siblings, NextFreeReplicaIndex slot allocation including ErrNoFreeSlot, capacity-gated scale-up with circuit breaker tripping and recovery on Register, scheduleDownIdle ordering, ClusterCapacity math, ReserveVRAM admission gating, Heartbeat reset, override survival across worker re-registration, and ResetMaxReplicasPerModel handing control back. Plus 8 stdlib tests for the worker processKey / CLI / auto-label. Closes the flap reproduced on Qwen3.6-35B against the nvidia-thor worker (single 128 GiB node, MinReplicas=2): the reconciler now caps the scale-up at the cluster's actual capacity instead of looping. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Read] [Edit] [Bash] [Skill:critique] [Skill:audit] [Skill:polish] [Skill:golang-testing] * refactor(react-ui/nodes): tighten capacity editor copy + adopt ActionMenu for row actions * Capacity editor hint trimmed from operator-doc-style ("Sourced from the worker's `--max-replicas-per-model` flag. Changing it here makes it a sticky admin override that survives worker restarts." → "Saved values stick across worker restarts.") and the override-state copy similarly compressed. The full mechanic is no longer needed in the UI — the override pill carries the meaning and the docs cover the rest. * Node row actions migrated from an inline cluster of icon buttons (Drain / Resume / Trash) to the kebab ActionMenu used by /manage for per-row model actions, so dense Nodes tables stay clean. Approve stays as a prominent primary button — it's a stateful admission gate, not a routine action, and elevating it matches how /manage surfaces install-time decisions outside the menu. * The expanded drawer's Labels section now filters node.replica-slots out of the editable label list. The label is owned by the Capacity editor above; surfacing it again as an editable label invited confusion (the Capacity save would clobber any direct edit). Both backend and agent workers benefit — they share the row rendering path, so the action menu and label filter apply to both. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] [Skill:critique] [Skill:audit] [Skill:polish] * fix(react-ui/nodes): suppress slot badge on agent workers Agent workers don't load models, so the per-node replica capacity is inapplicable to them. Showing "1× slots" on agent rows was a tiny inconsistency from the unified rendering path — gate the badge on node_type !== 'agent' so it only appears on backend workers. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] * refactor(react-ui/nodes): distill expanded drawer + restyle scheduling form The expanded node drawer used to stack five panels — slot badge, filled capacity box, Loaded Models h4+empty-state, Installed Backends h4+empty-state, Labels h4+chips+form — making routine inspections feel like a control panel. The scheduling rule form wrapped its mode toggle as two 50%-width filled buttons that competed visually with the actual primary action. * Drawer: collapse three rarely-touched config zones (Capacity, Backends, Labels) into one `<details>` "Manage" disclosure (closed by default) with small uppercase eyebrow labels for each zone instead of parallel h4 sub-headings. Loaded Models stays as the at-a-glance headline with a single-line empty hint instead of a boxed empty state. CapacityEditor renders flat (no filled background) — the Manage disclosure provides framing. * Scheduling form: replace the chunky 50%-width button-tabs with the project's existing `.segmented` control (icon + label, sized to content). Mode hint becomes a single tied line below. Fields stack vertically with helper text under inputs and a hairline divider above the right-aligned Save / Cancel. The empty drawer collapses from ~5 stacked sections (~280px tall) to two lines (~80px). The scheduling form now reads as a designed dialog instead of raw building blocks. Both surfaces now match the typographic density and weight of the rest of the admin pages. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] [Skill:distill] [Skill:audit] [Skill:polish] * feat(react-ui/nodes): replace scheduling form's model picker with searchable combobox The native <select> made operators scroll through every gallery entry to find a model name. The project already has SearchableModelSelect (used in Studio/Talk/etc.) which combines free-text search with the gallery list and accepts typed model names that aren't installed yet — useful for pre-staging a scheduling rule before the node it'll run on has finished bootstrapping. Also drops the now-unused useModels import (the combobox manages the gallery hook internally). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] * refactor(react-ui/nodes): consolidate key/value chip editor + add replica preset chips The Nodes page was rendering the same key=value chip pattern in two places with subtly different markup: the Labels editor in the expanded drawer and (post-distill) the Node Selector input in the scheduling form. The form's input was also a comma-separated string that operators were getting wrong. * Extract <KeyValueChips> as a fully controlled chip-builder. Parent owns the map and decides what onAdd/onRemove does — form state for the scheduling form, API calls for the live drawer Labels editor. Same visuals everywhere; one component to change when polish needs apply. * Replace the comma-separated Node Selector text input with KeyValueChips. Operators were copying syntax from docs and missing commas; the chip vocabulary makes the key=value structure self-documenting. * Add <ReplicaInput>: numeric input + quick-pick preset chips for Min/Max replicas. Picked over a slider because replica counts are exact specs derived from VRAM math (operator decision, not a fuzzy estimate). The chips give one-click access to common values (1/2/3/4 for Min, 0=no-limit/2/4/8 for Max) without the slider's special-value problem (MaxReplicas=0 is categorical, not a position on a continuum). * Drop the now-unused labelInputs state in the Nodes page (the inline label editor's per-node draft state lived there and is now owned by KeyValueChips). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [Skill:distill] * test: fix CI fallout from multi-replica refactor (e2e/distributed + playwright) Two breakages caught by CI that didn't surface in the local run: * tests/e2e/distributed/*.go — multiple files used the pre-PR2 registry signatures for SetNodeModel / IncrementInFlight / DecrementInFlight / RemoveNodeModel / TouchNodeModel / GetNodeModel / SetNodeModelLoadInfo and one stale adapter.InstallBackend call in node_lifecycle_test.go. All updated to pass replicaIndex=0 — these tests don't exercise multi-replica behavior, they just need to compile against the new signatures. The chip-builder tests in core/services/nodes/ already cover the multi-replica logic. * core/http/react-ui/e2e/nodes-per-node-backend-actions.spec.js — the drawer's distill refactor moved Backends inside a "Manage" <details> disclosure that's collapsed by default. The test helper expanded the node row but never opened Manage, so the per-node backend table was never in the DOM. Helper now clicks `.node-manage > summary` after expanding the row. All 100 playwright tests pass locally; tests/e2e/distributed compiles clean. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [Bash] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
13734ae9fa |
feat: Add Sherpa ONNX backend for ASR and TTS (#8523)
feat(backend): Add Sherpa ONNX backend and Omnilingual ASR Adds a new Go backend wrapping sherpa-onnx via purego (no cgo). Same approach as opus/stablediffusion-ggml/whisper — a thin C shim (csrc/shim.c + shim.h → libsherpa-shim.so) wraps the bits purego can't reach directly: nested struct config writes, result-struct field reads, and the streaming TTS callback trampoline. The Go side uses opaque uintptr handles and purego.NewCallback for the TTS callback. Supports: - VAD via sherpa-onnx's Silero VAD - Offline ASR: Whisper, Paraformer, SenseVoice, Omnilingual CTC - Online/streaming ASR: zipformer transducer with endpoint detection (AudioTranscriptionStream emits delta events during decode) - Offline TTS: VITS (LJS, etc.) - Streaming TTS: sherpa-onnx's callback API → PCM chunks on a channel, prefixed by a streaming WAV header Gallery entries: omnilingual-0.3b-ctc-q8-sherpa (1600-language offline ASR), streaming-zipformer-en-sherpa (low-latency streaming ASR), silero-vad-sherpa, vits-ljs-sherpa. E2E coverage: tests/e2e-backends for offline + streaming ASR, tests/e2e for the full realtime pipeline (VAD + STT + TTS). Assisted-by: claude-opus-4-7-1M [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
c54897ad44 |
fix(tests): update InstallBackend call sites for new URI/Name/Alias params (#9467)
Commit
|
||
|
|
e1a6010874 |
fix(streaming): deduplicate tool call emissions during streaming (#9292)
The Go-side incremental JSON parser was emitting the same tool call on
every streaming token because it lacked the len > lastEmittedCount guard
that the XML parser had. On top of that, the post-streaming default:
case re-emitted all tool calls from index 0, duplicating everything.
This produced duplicate delta.tool_calls events causing clients to
accumulate arguments as "{args}{args}" — invalid JSON.
Fixes:
- JSON incremental parser: add len(jsonResults) > lastEmittedCount guard
and loop from lastEmittedCount (matching the XML parser pattern)
- Post-streaming default: case: skip i < lastEmittedCount entries that
were already emitted during streaming
- JSON parser: use blocking channel send (matching XML parser behavior)
|
||
|
|
13a6ed709c |
fix: thinking models with tools returning empty content (reasoning-only retry loop) (#9290)
When clients like Nextcloud or Home Assistant send requests with tools to thinking models (e.g. Gemma 4 with <|channel>thought tags), the response was empty despite the backend producing valid content. Root cause: the C++ autoparser puts clean content in both the raw Response and ChatDeltas. The Go-side PrependThinkingTokenIfNeeded then prepends the thinking start token to the already-clean content, causing ExtractReasoning to classify the entire response as unclosed reasoning. This made cbRawResult empty, triggering a retry loop that never succeeds. Two fixes: - inference.go: check ChatDeltas for content/tool_calls regardless of whether Response is empty, so skipCallerRetry fires correctly - chat.go: when ChatDeltas have content but no tool calls, use that content directly instead of falling back to the empty cbRawResult |
||
|
|
85be4ff03c |
feat(api): add ollama compatibility (#9284)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
0f9d516a6c |
fix(anthropic): do not emit empty tokens and fix SSE tool calls (#9258)
This fixes Claude Code compatibility Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
773489eeb1 |
fix(chat): do not retry if we had chatdeltas or tooldeltas from backend (#9244)
* fix(chat): do not retry if we had chatdeltas or tooldeltas from backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: use oai compat for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: apply to non-streaming path too Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * map also other fields Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
6b6c136210 |
fix(inflight): count inflight from load model, but release afterwards (#9194)
This should fix the count of 1 in flight always showing in the node list Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
59108fbe32 |
feat: add distributed mode (#9124)
* feat: add distributed mode (experimental) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix data races, mutexes, transactions Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix events and tool stream in agent chat Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * use ginkgo Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(cron): compute correctly time boundaries avoiding re-triggering Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * enhancements, refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not flood of healthy checks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not list obvious backends as text backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * tests fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop redundant healthcheck Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * enhancements, refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
8cd3f9fc47 |
feat(ui, openai): Structured errors and link to traces in error toast (#9068)
First when sending errors over SSE we now clearly identify them as such instead of just sending the error string as a chat completion message. We use this in the UI to identify errors and link to them to the traces. Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
f9a850c02a |
feat(realtime): WebRTC support (#8790)
* feat(realtime): WebRTC support Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(tracing): Show full LLM opts and deltas Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
8818452d85 |
feat(ui): MCP Apps, mcp streaming and client-side support (#8947)
* Revert "fix: Add timeout-based wait for model deletion completion (#8756)"
This reverts commit
|
||
|
|
96efa4fce0 |
feat: add WebSocket mode support for the response api (#8676)
* feat: add WebSocket mode support for the response api Signed-off-by: bittoby <218712309+bittoby@users.noreply.github.com> * test: add e2e tests for WebSocket Responses API Signed-off-by: bittoby <218712309+bittoby@users.noreply.github.com> --------- Signed-off-by: bittoby <218712309+bittoby@users.noreply.github.com> |
||
|
|
697f6aa71c |
feat(audio): set audio content type (#8416)
* feat(audio): set audio content type Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: add tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
53276d28e7 |
feat(musicgen): add ace-step and UI interface (#8396)
* feat(musicgen): add ace-step and UI interface Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly handle model dir Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop auto-download Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to models, fixup UIs icons Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * l4t13 is incompatbile Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * avoid pinning version for cuda12 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop l4t12 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
b6459ddd57 |
feat(api): Add transcribe response format request parameter & adjust STT backends (#8318)
* WIP response format implementation for audio transcriptions (cherry picked from commit e271dd764bbc13846accf3beb8b6522153aa276f) Signed-off-by: Andres Smith <andressmithdev@pm.me> * Rework transcript response_format and add more formats (cherry picked from commit 6a93a8f63e2ee5726bca2980b0c9cf4ef8b7aeb8) Signed-off-by: Andres Smith <andressmithdev@pm.me> * Add test and replace go-openai package with official openai go client (cherry picked from commit f25d1a04e46526429c89db4c739e1e65942ca893) Signed-off-by: Andres Smith <andressmithdev@pm.me> * Fix faster-whisper backend and refactor transcription formatting to also work on CLI Signed-off-by: Andres Smith <andressmithdev@pm.me> (cherry picked from commit 69a93977d5e113eb7172bd85a0f918592d3d2168) Signed-off-by: Andres Smith <andressmithdev@pm.me> --------- Signed-off-by: Andres Smith <andressmithdev@pm.me> Co-authored-by: nanoandrew4 <nanoandrew4@gmail.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com> |
||
|
|
4077aaf978 |
chore: re-enable e2e tests, fixups anthropic API tools support (#8296)
* chore(tests): add mock backend e2e tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup anthropic tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * prepare e2e tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop repetitive tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop specific CI workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixup anthropic issues, move all e2e tests to use mocked backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
5ca8f0aea0 |
feat: add tool/function calling support to Anthropic Messages API (#7956)
* Initial plan * Add tool/function calling schema support to Anthropic Messages API Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> * Add E2E tests for Anthropic tool calling Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> * Make tool calling tests require model to use tools - First test now expects hasToolUse to be true with clear error message - Third test now expects toolUseID to be non-empty (removed conditional) - Both tests will now fail if model doesn't call the expected tools Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> * Add E2E test for tool calling with streaming responses - Tests that streaming events are properly emitted (content_block_start/delta/stop) - Verifies tool_use blocks are accumulated correctly in streaming mode - Ensures model calls tools and stop_reason is set to tool_use Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
4cbf9abfef |
feat: Add Anthropic Messages API support (#7948)
* Initial plan * Add Anthropic Messages API support Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> * Fix code review comments: add error handling for JSON operations Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> * Fix test suite to use existing schema test runner Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> * Add Anthropic e2e tests using anthropic-sdk-go for streaming and non-streaming Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> * Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
432513c3ba |
ci: add GPU tests (#1095)
* ci: test GPU Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: show logs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Debug * debug Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * split extra/core images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * split extra/core images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * consider runner host dir Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |