LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-05-30 11:36:31 -04:00

Author	SHA1	Message	Date
LocalAI [bot]	a891eedd08	fix(distributed): persist per-model load info so reconciler survives frontend restart (#9981 ) * feat(distributed): add per-model ModelLoadInfo persistence Adds a dedicated ModelLoadInfo table keyed by model name, decoupled from the per-replica NodeModel rows. The reconciler can now recover model load metadata after every NodeModel row has been removed (worker death, eviction, MarkOffline reaping, frontend restart with stale heartbeats), which is the read side of Bug-1 from the distributed mode bug hunt. Registry exposes: - UpsertModelLoadInfo: ON CONFLICT (model_name) update; last-write-wins, matching the existing per-replica blob semantics under concurrent multi-frontend dispatch. - GetModelLoadInfo: read from the new table first; fall back to the legacy NodeModel-blob scan for rows written before any frontend in the cluster ran an UpsertModelLoadInfo (rolling-upgrade transition). SetNodeModelLoadInfo (per-replica blob) is preserved for backward compatibility and per-replica diagnostics; the dispatch-path hook in the next commit calls both. The new table joins the existing nodes AutoMigrate set under the same schema-migration advisory lock. Refs: Bug-1, docs/superpowers/specs/2026-05-24-distributed-mode-bug-hunt-findings.md Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m] * fix(distributed): persist per-model load info on dispatch scheduleAndLoad now writes the (backendType, ModelOptions blob) pair to the new ModelLoadInfo table in addition to the existing per-replica NodeModel.model_opts_blob field. The per-replica blob still works for the hot path; the per-model row outlives every NodeModel row going away, which is what unblocks the reconciler on the read side. Both writes are best-effort with warn-level logging on failure: a write miss here just means the reconciler may need a fresh inference request to repopulate, which is the pre-fix behavior. Concurrency: two frontends loading the same model at the same time both fire UpsertModelLoadInfo; ON CONFLICT (model_name) makes the row converge to whichever commits last. Matches the existing per-replica blob semantics. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m] * test(distributed): cover load info persistence and Bug-1 recovery Adds Ginkgo specs that prove the persistence layer behaves correctly and that the reconciler actually recovers from the frontend-restart scenario that was failing in production: registry_test.go: - per-model row survives RemoveAllNodeModelReplicas (the bug repro) - ON CONFLICT (model_name) updates backend type + blob, last-write-wins - legacy NodeModel-blob fallback still works (rolling-upgrade transition) - GetModelLoadInfo returns ErrRecordNotFound when both sources are empty - UpsertModelLoadInfo rejects empty model names reconciler_test.go: - Bug-1 end-to-end: with min_replicas=2, no NodeModel rows, but a ModelLoadInfo row present, one reconcile tick fires two scheduler calls. Pre-fix this returned "no load info" and the scheduler never got called until a fresh inference request arrived. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m] * docs(distributed): note restart-safe reconciler behavior Adds a bullet to the Replica Reconciler section explaining that per-model load metadata is persisted across frontend restarts via the new model_load_infos PostgreSQL table, so a rolling upgrade no longer needs a fresh inference request per model before the reconciler can replace dead replicas. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-25 13:00:06 +02:00
LocalAI [bot]	06e777b75e	feat(distributed): gated X-LocalAI-Node response header (middleware + wrapper) (#9976 ) * feat(distributed): add per-request node ID context holder Introduce pkg/distributedhdr, a leaf package carrying a per-request atomic.Value holder for the picked worker node ID from the SmartRouter (core/services/nodes) up to the HTTP response writer wrapper (core/http/middleware). Avoids the import cycle that a shared key in either consumer would create. Exposes NewHolder, WithHolder, Holder, Stamp, Load, Inherit. The holder is atomic.Value so cross-goroutine publish from the router to the response writer wrapper is race-clean. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> feat(distributed): add ExposeNodeHeader middleware + response writer wrapper New ApplicationConfig.ExposeNodeHeader bool + --expose-node-header CLI flag / LOCALAI_EXPOSE_NODE_HEADER env var (default off; the node ID reveals internal topology and is opt-in). The middleware creates a per-request atomic.Value holder, attaches it to c.Request().Context() via distributedhdr.WithHolder, and wraps c.Response().Writer with a custom http.ResponseWriter that sets the X-LocalAI-Node header on first Write / WriteHeader / Flush by reading the holder. Implements http.Flusher, http.Hijacker, Unwrap so it composes cleanly with Echo and http.NewResponseController. request.go propagates the holder onto derived contexts via distributedhdr.Inherit so the holder survives the correlation-ID context replacement. Unit + race-clean concurrency + integration specs. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> feat(distributed): stamp node ID in router and wire middleware to inference routes ModelRouterAdapter.Route stamps the picked node ID into the per-request holder via distributedhdr.Stamp(ctx, result.Node.ID) right after replica selection. Wire ExposeNodeHeader middleware to: - OpenAI chat/completion/embeddings + audio transcriptions/speech + image generations/inpainting - Anthropic /v1/messages - Ollama /api/chat, /api/generate, /api/embed, /api/embeddings - Jina /v1/rerank - LocalAI /v1/vad The middleware's wrapper reads the holder on first byte and sets the X-LocalAI-Node response header before delegating to the underlying writer. Per-request scope means no race under concurrent multi-replica routing. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed): thread request context through backend Load + cover ctx propagation Five non-OpenAI backend helpers were silently using app.Context instead of the request context for the gRPC backend call: transcription, TTS, image generation, rerank, VAD. Effect: distributedhdr.Stamp in the router callback was a silent no-op for these paths, AND client cancellation didn't propagate to in-flight inference. Thread c.Request().Context() (or the equivalent input.Context after the request middleware has installed the correlation-ID derived context) through each helper and into ModelOptions via model.WithContext(ctx). ImageGeneration's signature gains a leading ctx parameter; in-tree callers (openai image, openai inpainting, openai inpainting_test) are updated to match. ModelEmbedding gains a leading ctx parameter for the same reason; the openai and ollama embedding handlers pass the request context through. chat_stream_workers.go defers the initial role=assistant chunk emission until the first token callback so the wrapper's lazy X-LocalAI-Node lookup against the loader runs AFTER ml.Load has stamped the per-modelID node ID; semantically identical for clients (role still arrives before any text). Regression test core/backend/ctx_propagation_test.go pins ctx propagation for all five helpers. Docs updated to enumerate the full endpoint coverage of the --expose-node-header flag. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-25 10:51:48 +02:00
LocalAI [bot]	a0f3e26245	fix(distributed): make admin backend installs resilient and observable (#9958 ) * feat(distributed): add configurable NATS backend install/upgrade timeouts Adds BackendInstallTimeout and BackendUpgradeTimeout to DistributedConfig with 15m defaults, following the existing MCPToolTimeout / WorkerWaitTimeout pattern. These will replace the hardcoded literals in RemoteUnloaderAdapter so admin-driven backend installs across the cluster survive long OCI image pulls that previously timed out at 3m. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(distributed): gofmt alignment after timeout fields Re-aligns the Validate() negative-duration map and the Default* const block so the new BackendInstall/UpgradeTimeout entries do not leave the surrounding columns mis-padded. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(cli): surface LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT and _UPGRADE_TIMEOUT Parses the two new env vars on the run CLI and threads them through the existing AppOption builder so DistributedConfig picks them up. Invalid duration strings now fail loudly at startup rather than silently falling back to the default. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): inject NATS install/upgrade timeouts into RemoteUnloaderAdapter Removes the hardcoded 3m / 15m literals from RemoteUnloaderAdapter and threads in DistributedConfig.BackendInstallTimeoutOrDefault() and BackendUpgradeTimeoutOrDefault() at construction. Install now defaults to 15m (was 3m); cold OCI image pulls on Jetson Wi-Fi routinely blew past the old ceiling. Scripted messaging client captures the timeout so tests can assert the configured value actually reaches the NATS request. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): introduce galleryop.ErrWorkerStillInstalling sentinel When the NATS request-reply for backend.install (or .upgrade) times out the worker is almost always still pulling the OCI image. Wrap the timeout in a typed sentinel so the manager above can distinguish "worker hung" from "worker still working" and leave the pending_backend_ops row in place for the reconciler to confirm via backend.list. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): treat NATS install timeout as in-progress, not failure When a worker times out replying to backend.install but the install is still running on the worker, enqueueAndDrainBackendOp now reports a running_on_worker status and pushes NextRetryAt out by the install timeout so the reconciler does not immediately re-fire another install while the worker is still pulling the image. The pending_backend_ops row stays in place for the next reconciler pass to confirm via backend.list. InstallBackend wraps the result in galleryop.ErrWorkerStillInstalling so callers can branch (galleryop renders yellow in-progress instead of red error). UpgradeBackend uses the same wrap. Adds RemoteUnloaderAdapter.InstallTimeout() so the manager can push NextRetryAt by the configured timeout without reaching into a private field, and NodeRegistry.RecordPendingBackendOpInFlight as the soft cousin of RecordPendingBackendOpFailure. Also includes incidental gofmt-driven struct-field alignment in registry.go on lines unrelated to the change (touched files are re-formatted to canonical form per project policy). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed): don't increment Attempts on in-flight install timeout An in-flight timeout (worker still pulling the OCI image) is not a failed attempt, it's a delayed one. Incrementing Attempts let genuinely-progressing slow installs (e.g. 30 GB CUDA images on Wi-Fi) trip the reconciler's maxPendingBackendOpAttempts cap and dead-letter the queue row while the worker was still legitimately working. RecordPendingBackendOpInFlight now only updates LastError and NextRetryAt. Also documents "running_on_worker" in the NodeOpStatus.Status enum comment so Task 6 implementers see the full surface. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): surface ErrWorkerStillInstalling as non-error OpStatus When the distributed backend manager returns an error that wraps ErrWorkerStillInstalling, backendHandler now completes the op with a "still installing in background" message rather than marking it as a red failure. Admin UI sees a yellow in-progress state; reconciler confirms completion on its next pass. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): end-to-end install-timeout-then-reconcile Wires Task 1-6 end-to-end so any seam mismatch surfaces in CI rather than during a real cluster install. NATS times out, the queue row stays alive with running_on_worker status, the worker eventually reports the backend installed via backend.list, the manager surfaces it via ListBackends. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT / _UPGRADE_TIMEOUT Add the two new operator-tunable env vars to the Frontend Configuration table in the distributed-mode docs. Explains the 15m default, when to raise it (slow links pulling multi-GB OCI images), and the new "still installing in background" admin-UI state when the round-trip times out but the worker is still working. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): clear pending install rows when backend.list confirms DistributedBackendManager.ListBackends now proactively clears pending_backend_ops install rows whose (nodeID, backend) is reported installed by backend.list. Operator UI updates immediately instead of waiting up to installTimeout (default 15m) for the next reconciler tick after NextRetryAt. Only install rows are cleared; upgrade and delete intents are not satisfied by presence in backend.list and continue to drain through their normal reconciler paths. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(messaging): add BackendInstallProgressEvent wire type and subject New NATS subject nodes.<nodeID>.backend.install.<opID>.progress lets the worker publish transient progress events (file, current/total bytes, percentage, phase) while a long-running install pulls its OCI image. BackendInstallRequest gains an optional OpID field so the worker knows which subject to publish on. Transient pub/sub (not JetStream): the install reply remains ground truth for success/failure; dropped progress events are tolerable. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(messaging): drop em-dash from BackendInstallProgress test comment Per project convention (no em-dashes anywhere). Comment substance is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): worker publishes debounced install progress over NATS When BackendInstallRequest.OpID is set, the worker's backend.install handler wires a debounced publisher (250ms window) into the gallery download callback. Each tick becomes a BackendInstallProgressEvent on nodes.<nodeID>.backend.install.<opID>.progress; the publisher always emits a final event on Flush so the UI sees the terminal percentage. Old masters that do not set OpID continue to run silent installs: no behavior change for them. Lock ordering: the publisher releases its mutex before calling messaging.Publish so a slow network never stalls the install loop. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): RemoteUnloaderAdapter subscribes to install progress InstallBackend gains opID + onProgress parameters. When both are set, the adapter subscribes to nodes.<nodeID>.backend.install.<opID>.progress BEFORE publishing the install request, decodes each message into the caller's onProgress callback in a goroutine (so a slow callback never stalls the NATS reader thread), and unsubscribes after RequestJSON returns. When onProgress is nil OR opID is empty (the reconciler retry path), subscription is skipped entirely - silent installs cost nothing extra. Subscribe failure is logged at Warn and the install proceeds without progress streaming; the NATS round-trip still owns terminal status. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): forward backend install progress into galleryop OpStatus DistributedBackendManager.InstallBackend now passes the gallery op ID and a progress bridge into the adapter call. Each BackendInstallProgressEvent from the worker becomes a galleryop.ProgressCallback tick - which the existing backendHandler already turns into OpStatus.UpdateStatus, so the admin UI/SSE polling sees per-byte progress for distributed installs without any UI-side change. UpgradeBackend is intentionally left silent for now: its wire request (BackendUpgradeRequest) does not carry OpID, and rolling-update fallback is the rarer path. Will be picked up in a follow-up if the worker upgrade path also gets a progress channel. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): InstallBackend tolerates silent (pre-Phase-2) workers A worker on pre-Phase-2 code never publishes progress events. The new master subscribes optimistically; this spec pins that a silent worker still produces a green install with no progressCb ticks. The install reply is the source of truth for terminal state; the progress stream is a best-effort UX enrichment. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document install progress streaming Note the new nodes.<nodeID>.backend.install.<opID>.progress subject and the silent-worker compatibility behavior so operators know to expect real-time progress and what happens on a mixed-version cluster. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): note progress-event ordering trade-off in InstallBackend Document near the goroutine dispatch why ordering at the consumer is best-effort, why it rarely matters in practice (worker debounce >> goroutine jitter), and what a future hardening pass would look like (Seq field + stale-by-seq drop). Stops the next reader from accidentally "fixing" the goroutine pool away. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): add NodeProgress + OpStatus.Nodes for per-node breakdown Adds the data model the UI needs to render an expandable per-node breakdown of a fanned-out backend install. NodeProgress carries node identity (ID + name), per-node status (queued / running_on_worker / success / error / downloading), the current file + bytes + percentage from the Phase 2 progress stream, and any per-node error. OpStatus.Nodes is the slice the /api/operations handler will surface in a follow-up. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): UpdateNodeProgress merges per-node ticks by NodeID GalleryService.UpdateNodeProgress(opID, nodeID, np) merges a NodeProgress into OpStatus.Nodes (keyed by NodeID, no duplicates) and mirrors the latest tick into the aggregate Progress / FileName / DownloadedFileSize / TotalFileSize fields so the legacy single-bar OperationsBar view keeps working unchanged alongside the new per-node breakdown. Concurrent-safe via the existing g.Mutex. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): write per-node OpStatus entries during install fan-out DistributedBackendManager now accepts a nodeProgressSink and feeds it two streams: 1. enqueueAndDrainBackendOp emits a per-node terminal entry on each status it appends to BackendOpResult (queued, success, error, running_on_worker). The opID is threaded through the function so the sink gets the right gallery op identity. 2. The install apply closure fans each BackendInstallProgressEvent into the sink as a downloading entry, alongside the legacy progressCb path so the aggregate single-bar view stays correct. Production wiring passes the GalleryService (which implements UpdateNodeProgress via Task 2) as the sink. Single-node tests pass nil. DeleteBackend and UpgradeBackend pass an empty opID so the sink path no-ops for ops that aren't gallery-tracked the same way as Install. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(operations): expose per-node breakdown on /api/operations When an operation's OpStatus has Nodes entries (populated by the Phase 4 progress sink wiring), surface them as a "nodes" array on the /api/operations response, sorted by node_name for stable rendering. Backward compatible: legacy clients ignore the field; ops without any node entries (single-node mode, model installs) omit the array entirely thanks to the empty-slice guard. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): per-node breakdown in OperationsBar When an install op fans out to more than one worker, the operations bar now shows a "N nodes" chevron that expands into a per-node list. Each row carries the node's status (color-coded pill), the current file being downloaded, byte counts, percentage, and a thin per-node progress bar. Yellow "Worker busy" pill marks running_on_worker status with a tooltip explaining the NATS round-trip timed out but the worker is still installing in the background. Backward compatible: ops without a nodes field (legacy or single-node mode) render as before. State for expand/collapse is local to the component, keyed by jobID/id - reload starts collapsed. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document per-node breakdown in the operations bar Adds a short subsection covering the expandable "N nodes" chevron in the OperationsBar admin UI, the meaning of each status pill, and how it relates to the /api/operations nodes array. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(galleryop): UpdateStatus preserves Nodes when caller sends none Real-world bug surfaced by the Phase 4 multi-worker smoke test: the nodes[] array in /api/operations flickered between a single node at a time on a 2-worker install. Root cause: the Phase 2 progress bridge also calls the legacy progressCb -> UpdateStatus(&OpStatus{...}) on every tick. UpdateStatus then overwrote the entire status pointer, wiping the Nodes slice that UpdateNodeProgress had just merged in. Fix: in UpdateStatus, if the incoming op has an empty Nodes slice, carry forward the previous status's Nodes before storing. Callers that explicitly populate Nodes still win (their slice replaces the prior one, no merge across the two code paths). Two regression specs added pinning both directions of the contract. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): strip implementation details from user-facing docs Trim the new install/upgrade timeout rows and the install-progress sections to focus on what the operator sees and tunes. Drops: - the NATS subject names and pub/sub mechanics - "round-trip" / reconciler / backend.list jargon - /api/operations polling cadence - "pre-2026-05-22" version references Reframes the breakdown text around the admin UI (Operations Bar, chevron, status pills, "Worker busy" tooltip). Implementation context lives in the agent notes and code comments. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(config): move DistributedConfig.Validate flag names to constants The negative-duration check map was a wall of literal kebab-case strings that had to stay in sync with the kong-derived CLI flag names manually. Move them to a Flag* const block alongside the existing Default* block so a rename of either the Go field or the CLI naming convention forces a compile error rather than silent drift. Sole consumer today is Validate; the constants are exported so future operator-facing surfaces (e.g. error messages on other validation paths) can reference them by name instead of repeating the literals. Tests pin both the literal values (so a future "let's just rename this" doesn't accidentally regress the CLI flag) and the negative- duration error message for the new BackendInstall / BackendUpgrade fields. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(distributed): extract NodeStatus and Phase enums to constants Sweep for the same literal-string-as-identifier pattern called out on the Validate flag names: the per-node install status enum ("queued" \| "downloading" \| "running_on_worker" \| "success" \| "error") appeared as raw literals across managers_distributed.go (10+ sites, including 3 separate `n.Status == "running_on_worker"` checks), operation.go, and the test suite. Same shape for the Phase enum ("resolving" \| "downloading" \| "extracting" \| "starting") in the worker-side progress publisher. Promote both to exported const blocks: - galleryop.NodeStatus{Queued,Downloading,RunningOnWorker,Success,Error} shared between galleryop.NodeProgress.Status (the wire field) and nodes.NodeOpStatus.Status (the in-process per-node summary) - messaging.Phase{Resolving,Downloading,Extracting,Starting} shared between the worker publisher and any future consumer that needs to switch on phase Tests pin both the literal values (so a future "let's just rename" doesn't silently change the JSON wire) and use the constants in setup (so the producer side stays drift-protected). Wire-format assertions on the /api/operations JSON output keep their literals deliberately, so the constant value can never silently diverge from what the UI receives. Out of scope for this PR (separate cleanup): the finetune and quantization job-status enums have the same anti-pattern with 14+ literal sites each, but predate this PR's work. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-23 12:35:44 +02:00
Richard Palethorpe	8e43842175	feat(vllm, distributed): tensor parallel distributed workers (#9612 ) * feat(vllm): build vllm from source for Intel XPU Upstream publishes no XPU wheels for vllm. The Intel profile was silently picking up a non-XPU wheel that imported but errored at engine init, and several runtime deps (pillow, charset-normalizer, chardet) were missing on Intel -- backend.py crashed at import time before the gRPC server came up. Switch the Intel profile to upstream's documented from-source procedure (docs/getting_started/installation/gpu.xpu.inc.md in vllm-project/vllm): - Bump portable Python to 3.12 -- vllm-xpu-kernels ships only a cp312 wheel. - Source /opt/intel/oneapi/setvars.sh so vllm's CMake build sees the dpcpp/sycl compiler from the oneapi-basekit base image. - Hide requirements-intel-after.txt during installRequirements (it used to 'pip install vllm'); install vllm's deps from a fresh git clone of vllm via 'uv pip install -r requirements/xpu.txt', swap stock triton for triton-xpu==3.7.0, then 'VLLM_TARGET_DEVICE=xpu uv pip install --no-deps .'. - requirements-intel.txt trimmed to LocalAI's direct deps (accelerate / transformers / bitsandbytes); torch-xpu, vllm, vllm_xpu_kernels and the rest come from upstream's xpu.txt during the source build. - requirements.txt: add pillow + charset-normalizer + chardet -- used by backend.py and missing on the Intel install profile. - run.sh: 'set -x' so backend startup is visible in container logs (the gRPC startup error path was previously opaque). Also adds a one-line docs example for engine_args.attention_backend under the vLLM section, since older XE-HPG GPUs (e.g. Arc A770) need TRITON_ATTN to bypass the cutlass path in vllm_xpu_kernels. Tested end-to-end on an Intel Arc A770 with Qwen2.5-0.5B-Instruct via LocalAI's /v1/chat/completions. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(vllm): add multi-node data-parallel follower worker vLLM v1's multi-node story is one process per node sharing a DP coordinator over ZMQ -- the head runs the API server with data_parallel_size > 1 and followers run `vllm serve --headless ...` with matching topology. Today LocalAI can already configure DP on the head via the engine_args YAML map, but there's no way to bring up the follower nodes -- so the head sits waiting for ranks that never handshake. Add `local-ai p2p-worker vllm`, mirroring MLXDistributed's structural precedent (operator-launched, static config, no NATS placement). The worker: - Optionally self-registers with the frontend as an agent-type node tagged `node.role=vllm-follower` so it's visible in the admin UI and operators can scope ordinary models away via inverse selectors. - Resolves the platform-specific vllm backend via the gallery's "vllm" meta-entry (cuda, intel-vllm, rocm-vllm, ...). - Runs vLLM as a child process so the heartbeat goroutine survives until vLLM exits; forwards SIGINT/SIGTERM so vLLM can clean up its ZMQ sockets before we tear down. - Validates --headless + --start-rank 0 is rejected (rank 0 is the head and must serve the API). Backend run.sh dispatches `serve` as the first arg to vllm's own CLI instead of LocalAI's backend.py gRPC server -- the follower speaks ZMQ directly to the head, there is no LocalAI gRPC on the follower side. Single-node usage is unchanged. Generalises the gallery resolution helper into findBackendPath() shared by MLX and vLLM workers; extracts ParseNodeLabels for the comma-separated label parsing both use. Ships with two compose recipes (`docker-compose.vllm-multinode.yaml` for NVIDIA, `docker-compose.vllm-multinode.intel.yaml` for Intel XPU/xccl) plus `tests/e2e/vllm-multinode/smoke.sh`. Both vendors are supported (NCCL for CUDA/ROCm, xccl for XPU) but mixed-vendor DP is not -- PyTorch's process group requires every rank to use the same collective backend, and NCCL/xccl/gloo don't interoperate. Out of scope (deferred): SmartRouter-driven placement of follower ranks via NATS backend.install events, follower log streaming through /api/backend-logs, tensor-parallel across nodes, disaggregated prefill via KVTransferConfig. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> test(vllm): CPU-only end-to-end test for multi-node DP Adds tests/e2e/vllm-multinode/, a Ginkgo + testcontainers-go suite that brings up a head + headless follower from the locally-built local-ai:tests image, bind-mounts the cpu-vllm backend extracted by make extract-backend-vllm so it's seen as a system backend (no gallery fetch, no registry server), and asserts a chat completion across both DP ranks. New `make test-e2e-vllm-multinode` target wires the docker build, backend extract, and ginkgo run together; BuildKit caches both images so re-runs only rebuild what changed. Tagged Label("VLLMMultinode") so the existing distributed suite isn't pulled along. Two pre-existing bugs surfaced by the test: 1. extract-backend-% (Makefile) failed for every backend, because all backend images end with `FROM scratch` and `docker create` rejects an image with no CMD/ENTRYPOINT. Fixed by passing --entrypoint=/run.sh -- the container is never started, only docker-cp'd, so the path doesn't have to exist; we just need anything that satisfies the daemon's create-time validation. 2. backend/python/vllm/run.sh's `serve` shortcut for the multi-node DP follower exec'd ${EDIR}/venv/bin/vllm directly, but uv bakes an absolute build-time shebang (`#!/vllm/venv/bin/python3`) that no longer resolves once the backend is relocated to BackendsPath. _makeVenvPortable's shebang rewriter only matches paths that already point at ${EDIR}, so the original shebang slips through unchanged. Fixed by exec-ing ${EDIR}/venv/bin/python with the script as an argument -- Python ignores the script's shebang in that case. The test fixture caps memory aggressively (max_model_len=512, VLLM_CPU_KVCACHE_SPACE=1, TORCH_COMPILE_DISABLE=1) so two CPU engines fit on a 32 GB box. TORCH_COMPILE_DISABLE is currently mandatory for cpu-vllm: torch._inductor's CPU-ISA probe runs even with enforce_eager=True and needs g++ on PATH, which the LocalAI runtime image doesn't ship -- to be addressed in a follow-up that bundles a toolchain in the cpu-vllm backend. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(vllm): bundle a g++ toolchain in the cpu-vllm backend image torch._inductor's CPU-ISA probe (`cpu_model_runner.py:65 "Warming up model for the compilation"`) shells out to `g++` at vllm engine startup, regardless of `enforce_eager=True` -- the eager flag only disables CUDA graphs, not inductor's first-batch warmup. The LocalAI CPU runtime image (Dockerfile, unconditional apt list) does not ship build-essential, and the cpu-vllm backend image is `FROM scratch`, so any non-trivial inference on cpu-vllm crashes with: torch._inductor.exc.InductorError: InvalidCxxCompiler: No working C++ compiler found in torch._inductor.config.cpp.cxx: (None, 'g++') Bundling the toolchain in the CPU runtime image would bloat every non-vllm-CPU deployment and force a single GCC version on backends that may want clang or a different version. So this lives in the backend, gated to BUILD_TYPE=='' (the CPU profile). `package.sh` snapshots g++ + binutils + cc1plus + libstdc++ + libc6 (runtime + dev) + the math libs cc1plus links (libisl/libmpc/libmpfr/ libjansson) into ${BACKEND}/toolchain/, mirroring /usr/... layout. The unversioned binaries on Debian/Ubuntu are symlink chains pointing into multiarch packages (`g++` -> `g++-13` -> `x86_64-linux-gnu-g++-13`, the latter in `g++-13-x86-64-linux-gnu`), so the package list resolves both the version and the arch-triplet variant. Symlinks /lib -> usr/lib and /lib64 -> usr/lib64 are recreated under the toolchain root because Ubuntu's UsrMerge keeps them at /, and ld scripts (`libc.so`, `libm.so`) hardcode `/lib/...` paths that --sysroot re-roots into the toolchain. The unversioned `g++`/`gcc`/`cpp` symlinks are replaced with wrapper shell scripts that resolve their own location at runtime and pass `--sysroot=<toolchain>` and `-B <toolchain>/usr/lib/gcc/<triplet>/<ver>/` to the underlying versioned binary. That's how torch's bare `g++ foo.cpp -o foo` invocation finds cc1plus (-B), system headers (--sysroot), and the bundled libstdc++ (--sysroot, --sysroot is recursive into linker). `run.sh` adds the toolchain bin dir to PATH and the toolchain's shared-lib dir to LD_LIBRARY_PATH -- everything else (header search, linker search, executable search) is encapsulated in the wrappers. No-op for non-CPU builds, the dir doesn't exist there. The cpu-vllm image grows by ~217 MB. Tradeoff is acceptable -- cpu-vllm is already a niche profile (few users compared to GPU vllm) and the alternative is a backend that crashes at first inference unless the operator manually sets TORCH_COMPILE_DISABLE=1, which silently disables all torch.compile optimizations. Drops `TORCH_COMPILE_DISABLE=1` from tests/e2e/vllm-multinode -- the smoke now exercises the real compile path through the bundled toolchain. Test runtime is +20s for the warmup compile, still <90s end to end. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(vllm): scope jetson-ai-lab index to L4T-specific wheels via pyproject.toml The L4T arm64 build resolves dependencies through pypi.jetson-ai-lab.io, which hosts the L4T-specific torch / vllm / flash-attn wheels but also transparently proxies the rest of PyPI through `/+f/<sha>/<filename>` URLs. With `--extra-index-url` + `--index-strategy=unsafe-best-match` uv would pick those proxy URLs for ordinary PyPI packages — anthropic/openai/propcache/annotated-types — and fail when the proxy 503s. Master is hitting the same bug on its own l4t-vllm matrix entry. Switch the l4t13 install path to a pyproject.toml that marks the jetson-ai-lab index `explicit = true` and pins only torch, torchvision, torchaudio, flash-attn, and vllm to it via [tool.uv.sources]. uv won't consult the L4T mirror for anything else, so transitive deps fall back to PyPI as the default index — no exposure to the proxy 503s. `uv pip install -r requirements.txt` ignores [tool.uv.sources], so the l4t13 branch in install.sh now invokes `uv pip install --requirement pyproject.toml` directly, replacing the old requirements-l4t13*.txt files. Other BUILD_PROFILEs continue using libbackend.sh's installRequirements and never read pyproject.toml. Local resolution test (x86_64, dry-run) confirms uv hits the L4T index for torch and falls through to PyPI for everything else. Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-06 00:22:50 +02:00
Ettore Di Giacinto	551ebdb57a	fix(distributed): correct VRAM/RAM reporting on NVIDIA unified-memory hosts (#9545 ) Workers on NVIDIA unified-memory hardware (DGX Spark / GB10, Jetson AGX Thor, Jetson Orin/Xavier/Nano) were reporting `available_vram=0` back to the frontend, so the Nodes UI showed the node as fully used even when most of the unified memory was actually free. Three causes addressed: * `isTegraDevice` only matched `/sys/devices/soc0/family == "Tegra"`. DGX Spark (SBSA) reports JEDEC codes there instead — `jep106:0426` for the NVIDIA manufacturer — so the Tegra/unified-memory fallback never ran. Renamed to `isNVIDIAIntegratedGPU` and extended to also match `jep106:0426[:]` via `/sys/devices/soc0/soc_id`. The unified-iGPU code defaulted the device name to `"NVIDIA Jetson"` when `/proc/device-tree/model` was missing. That's what happens for Thor inside a docker container, and always on DGX Spark. New `nvidiaIntegratedGPUName` resolves via dt-model → `/sys/devices/soc0/machine` → `soc_id` lookup (`jep106:0426:8901` → `"NVIDIA GB10"`) so the Nodes UI labels the box correctly. * Worker heartbeat sent `available_vram=0` (or total-as-available) when VRAM usage was momentarily unknown — e.g. when `nvidia-smi` intermittently failed with `waitid: no child processes` under containers without `--init`. Each such heartbeat overwrote the DB and made the UI flip to "fully used". `heartbeatBody` now omits `available_vram` in that case so the DB keeps its last good value. Also updates the commented GPU blocks in both compose files with `NVIDIA_DRIVER_CAPABILITIES=compute,utility`, `capabilities: [gpu, utility]`, and `init: true`, and documents the requirement in the distributed-mode and nvidia-l4t pages. Without `utility`, NVML/`nvidia-smi` are absent inside the container, which is what put the DGX Spark worker into the buggy fallback in the first place. Detection verified on live hardware (dgx.casa / GB10 and 192.168.68.23 / Thor) by running a cross-compiled probe of the new helpers on both host and inside the worker container. Assisted-by: Claude:opus-4.7 [Claude Code]	2026-04-24 22:02:23 +02:00
Ettore Di Giacinto	7e0b73deaa	fix(docs): fix broken references to distributed mode Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-03 09:46:06 +02:00
Ettore Di Giacinto	8862e3ce60	feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler (#9186 ) * always enable parallel requests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: move tests to ginkgo Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(smart router): order by available vram Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-31 08:28:56 +02:00
Ettore Di Giacinto	59108fbe32	feat: add distributed mode (#9124 ) * feat: add distributed mode (experimental) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix data races, mutexes, transactions Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix events and tool stream in agent chat Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * use ginkgo Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(cron): compute correctly time boundaries avoiding re-triggering Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * enhancements, refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not flood of healthy checks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not list obvious backends as text backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * tests fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop redundant healthcheck Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * enhancements, refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-30 00:47:27 +02:00

8 Commits