LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-05-17 13:10:23 -04:00

Author	SHA1	Message	Date
Richard Palethorpe	8e43842175	feat(vllm, distributed): tensor parallel distributed workers (#9612 ) * feat(vllm): build vllm from source for Intel XPU Upstream publishes no XPU wheels for vllm. The Intel profile was silently picking up a non-XPU wheel that imported but errored at engine init, and several runtime deps (pillow, charset-normalizer, chardet) were missing on Intel -- backend.py crashed at import time before the gRPC server came up. Switch the Intel profile to upstream's documented from-source procedure (docs/getting_started/installation/gpu.xpu.inc.md in vllm-project/vllm): - Bump portable Python to 3.12 -- vllm-xpu-kernels ships only a cp312 wheel. - Source /opt/intel/oneapi/setvars.sh so vllm's CMake build sees the dpcpp/sycl compiler from the oneapi-basekit base image. - Hide requirements-intel-after.txt during installRequirements (it used to 'pip install vllm'); install vllm's deps from a fresh git clone of vllm via 'uv pip install -r requirements/xpu.txt', swap stock triton for triton-xpu==3.7.0, then 'VLLM_TARGET_DEVICE=xpu uv pip install --no-deps .'. - requirements-intel.txt trimmed to LocalAI's direct deps (accelerate / transformers / bitsandbytes); torch-xpu, vllm, vllm_xpu_kernels and the rest come from upstream's xpu.txt during the source build. - requirements.txt: add pillow + charset-normalizer + chardet -- used by backend.py and missing on the Intel install profile. - run.sh: 'set -x' so backend startup is visible in container logs (the gRPC startup error path was previously opaque). Also adds a one-line docs example for engine_args.attention_backend under the vLLM section, since older XE-HPG GPUs (e.g. Arc A770) need TRITON_ATTN to bypass the cutlass path in vllm_xpu_kernels. Tested end-to-end on an Intel Arc A770 with Qwen2.5-0.5B-Instruct via LocalAI's /v1/chat/completions. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(vllm): add multi-node data-parallel follower worker vLLM v1's multi-node story is one process per node sharing a DP coordinator over ZMQ -- the head runs the API server with data_parallel_size > 1 and followers run `vllm serve --headless ...` with matching topology. Today LocalAI can already configure DP on the head via the engine_args YAML map, but there's no way to bring up the follower nodes -- so the head sits waiting for ranks that never handshake. Add `local-ai p2p-worker vllm`, mirroring MLXDistributed's structural precedent (operator-launched, static config, no NATS placement). The worker: - Optionally self-registers with the frontend as an agent-type node tagged `node.role=vllm-follower` so it's visible in the admin UI and operators can scope ordinary models away via inverse selectors. - Resolves the platform-specific vllm backend via the gallery's "vllm" meta-entry (cuda, intel-vllm, rocm-vllm, ...). - Runs vLLM as a child process so the heartbeat goroutine survives until vLLM exits; forwards SIGINT/SIGTERM so vLLM can clean up its ZMQ sockets before we tear down. - Validates --headless + --start-rank 0 is rejected (rank 0 is the head and must serve the API). Backend run.sh dispatches `serve` as the first arg to vllm's own CLI instead of LocalAI's backend.py gRPC server -- the follower speaks ZMQ directly to the head, there is no LocalAI gRPC on the follower side. Single-node usage is unchanged. Generalises the gallery resolution helper into findBackendPath() shared by MLX and vLLM workers; extracts ParseNodeLabels for the comma-separated label parsing both use. Ships with two compose recipes (`docker-compose.vllm-multinode.yaml` for NVIDIA, `docker-compose.vllm-multinode.intel.yaml` for Intel XPU/xccl) plus `tests/e2e/vllm-multinode/smoke.sh`. Both vendors are supported (NCCL for CUDA/ROCm, xccl for XPU) but mixed-vendor DP is not -- PyTorch's process group requires every rank to use the same collective backend, and NCCL/xccl/gloo don't interoperate. Out of scope (deferred): SmartRouter-driven placement of follower ranks via NATS backend.install events, follower log streaming through /api/backend-logs, tensor-parallel across nodes, disaggregated prefill via KVTransferConfig. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> test(vllm): CPU-only end-to-end test for multi-node DP Adds tests/e2e/vllm-multinode/, a Ginkgo + testcontainers-go suite that brings up a head + headless follower from the locally-built local-ai:tests image, bind-mounts the cpu-vllm backend extracted by make extract-backend-vllm so it's seen as a system backend (no gallery fetch, no registry server), and asserts a chat completion across both DP ranks. New `make test-e2e-vllm-multinode` target wires the docker build, backend extract, and ginkgo run together; BuildKit caches both images so re-runs only rebuild what changed. Tagged Label("VLLMMultinode") so the existing distributed suite isn't pulled along. Two pre-existing bugs surfaced by the test: 1. extract-backend-% (Makefile) failed for every backend, because all backend images end with `FROM scratch` and `docker create` rejects an image with no CMD/ENTRYPOINT. Fixed by passing --entrypoint=/run.sh -- the container is never started, only docker-cp'd, so the path doesn't have to exist; we just need anything that satisfies the daemon's create-time validation. 2. backend/python/vllm/run.sh's `serve` shortcut for the multi-node DP follower exec'd ${EDIR}/venv/bin/vllm directly, but uv bakes an absolute build-time shebang (`#!/vllm/venv/bin/python3`) that no longer resolves once the backend is relocated to BackendsPath. _makeVenvPortable's shebang rewriter only matches paths that already point at ${EDIR}, so the original shebang slips through unchanged. Fixed by exec-ing ${EDIR}/venv/bin/python with the script as an argument -- Python ignores the script's shebang in that case. The test fixture caps memory aggressively (max_model_len=512, VLLM_CPU_KVCACHE_SPACE=1, TORCH_COMPILE_DISABLE=1) so two CPU engines fit on a 32 GB box. TORCH_COMPILE_DISABLE is currently mandatory for cpu-vllm: torch._inductor's CPU-ISA probe runs even with enforce_eager=True and needs g++ on PATH, which the LocalAI runtime image doesn't ship -- to be addressed in a follow-up that bundles a toolchain in the cpu-vllm backend. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(vllm): bundle a g++ toolchain in the cpu-vllm backend image torch._inductor's CPU-ISA probe (`cpu_model_runner.py:65 "Warming up model for the compilation"`) shells out to `g++` at vllm engine startup, regardless of `enforce_eager=True` -- the eager flag only disables CUDA graphs, not inductor's first-batch warmup. The LocalAI CPU runtime image (Dockerfile, unconditional apt list) does not ship build-essential, and the cpu-vllm backend image is `FROM scratch`, so any non-trivial inference on cpu-vllm crashes with: torch._inductor.exc.InductorError: InvalidCxxCompiler: No working C++ compiler found in torch._inductor.config.cpp.cxx: (None, 'g++') Bundling the toolchain in the CPU runtime image would bloat every non-vllm-CPU deployment and force a single GCC version on backends that may want clang or a different version. So this lives in the backend, gated to BUILD_TYPE=='' (the CPU profile). `package.sh` snapshots g++ + binutils + cc1plus + libstdc++ + libc6 (runtime + dev) + the math libs cc1plus links (libisl/libmpc/libmpfr/ libjansson) into ${BACKEND}/toolchain/, mirroring /usr/... layout. The unversioned binaries on Debian/Ubuntu are symlink chains pointing into multiarch packages (`g++` -> `g++-13` -> `x86_64-linux-gnu-g++-13`, the latter in `g++-13-x86-64-linux-gnu`), so the package list resolves both the version and the arch-triplet variant. Symlinks /lib -> usr/lib and /lib64 -> usr/lib64 are recreated under the toolchain root because Ubuntu's UsrMerge keeps them at /, and ld scripts (`libc.so`, `libm.so`) hardcode `/lib/...` paths that --sysroot re-roots into the toolchain. The unversioned `g++`/`gcc`/`cpp` symlinks are replaced with wrapper shell scripts that resolve their own location at runtime and pass `--sysroot=<toolchain>` and `-B <toolchain>/usr/lib/gcc/<triplet>/<ver>/` to the underlying versioned binary. That's how torch's bare `g++ foo.cpp -o foo` invocation finds cc1plus (-B), system headers (--sysroot), and the bundled libstdc++ (--sysroot, --sysroot is recursive into linker). `run.sh` adds the toolchain bin dir to PATH and the toolchain's shared-lib dir to LD_LIBRARY_PATH -- everything else (header search, linker search, executable search) is encapsulated in the wrappers. No-op for non-CPU builds, the dir doesn't exist there. The cpu-vllm image grows by ~217 MB. Tradeoff is acceptable -- cpu-vllm is already a niche profile (few users compared to GPU vllm) and the alternative is a backend that crashes at first inference unless the operator manually sets TORCH_COMPILE_DISABLE=1, which silently disables all torch.compile optimizations. Drops `TORCH_COMPILE_DISABLE=1` from tests/e2e/vllm-multinode -- the smoke now exercises the real compile path through the bundled toolchain. Test runtime is +20s for the warmup compile, still <90s end to end. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(vllm): scope jetson-ai-lab index to L4T-specific wheels via pyproject.toml The L4T arm64 build resolves dependencies through pypi.jetson-ai-lab.io, which hosts the L4T-specific torch / vllm / flash-attn wheels but also transparently proxies the rest of PyPI through `/+f/<sha>/<filename>` URLs. With `--extra-index-url` + `--index-strategy=unsafe-best-match` uv would pick those proxy URLs for ordinary PyPI packages — anthropic/openai/propcache/annotated-types — and fail when the proxy 503s. Master is hitting the same bug on its own l4t-vllm matrix entry. Switch the l4t13 install path to a pyproject.toml that marks the jetson-ai-lab index `explicit = true` and pins only torch, torchvision, torchaudio, flash-attn, and vllm to it via [tool.uv.sources]. uv won't consult the L4T mirror for anything else, so transitive deps fall back to PyPI as the default index — no exposure to the proxy 503s. `uv pip install -r requirements.txt` ignores [tool.uv.sources], so the l4t13 branch in install.sh now invokes `uv pip install --requirement pyproject.toml` directly, replacing the old requirements-l4t13*.txt files. Other BUILD_PROFILEs continue using libbackend.sh's installRequirements and never read pyproject.toml. Local resolution test (x86_64, dry-run) confirms uv hits the L4T index for torch and falls through to PyPI for everything else. Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-06 00:22:50 +02:00
Ettore Di Giacinto	551ebdb57a	fix(distributed): correct VRAM/RAM reporting on NVIDIA unified-memory hosts (#9545 ) Workers on NVIDIA unified-memory hardware (DGX Spark / GB10, Jetson AGX Thor, Jetson Orin/Xavier/Nano) were reporting `available_vram=0` back to the frontend, so the Nodes UI showed the node as fully used even when most of the unified memory was actually free. Three causes addressed: * `isTegraDevice` only matched `/sys/devices/soc0/family == "Tegra"`. DGX Spark (SBSA) reports JEDEC codes there instead — `jep106:0426` for the NVIDIA manufacturer — so the Tegra/unified-memory fallback never ran. Renamed to `isNVIDIAIntegratedGPU` and extended to also match `jep106:0426[:]` via `/sys/devices/soc0/soc_id`. The unified-iGPU code defaulted the device name to `"NVIDIA Jetson"` when `/proc/device-tree/model` was missing. That's what happens for Thor inside a docker container, and always on DGX Spark. New `nvidiaIntegratedGPUName` resolves via dt-model → `/sys/devices/soc0/machine` → `soc_id` lookup (`jep106:0426:8901` → `"NVIDIA GB10"`) so the Nodes UI labels the box correctly. * Worker heartbeat sent `available_vram=0` (or total-as-available) when VRAM usage was momentarily unknown — e.g. when `nvidia-smi` intermittently failed with `waitid: no child processes` under containers without `--init`. Each such heartbeat overwrote the DB and made the UI flip to "fully used". `heartbeatBody` now omits `available_vram` in that case so the DB keeps its last good value. Also updates the commented GPU blocks in both compose files with `NVIDIA_DRIVER_CAPABILITIES=compute,utility`, `capabilities: [gpu, utility]`, and `init: true`, and documents the requirement in the distributed-mode and nvidia-l4t pages. Without `utility`, NVML/`nvidia-smi` are absent inside the container, which is what put the DGX Spark worker into the buggy fallback in the first place. Detection verified on live hardware (dgx.casa / GB10 and 192.168.68.23 / Thor) by running a cross-compiled probe of the new helpers on both host and inside the worker container. Assisted-by: Claude:opus-4.7 [Claude Code]	2026-04-24 22:02:23 +02:00
Ettore Di Giacinto	7e0b73deaa	fix(docs): fix broken references to distributed mode Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-03 09:46:06 +02:00
Ettore Di Giacinto	8862e3ce60	feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler (#9186 ) * always enable parallel requests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: move tests to ginkgo Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(smart router): order by available vram Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-31 08:28:56 +02:00
Ettore Di Giacinto	59108fbe32	feat: add distributed mode (#9124 ) * feat: add distributed mode (experimental) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix data races, mutexes, transactions Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix events and tool stream in agent chat Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * use ginkgo Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(cron): compute correctly time boundaries avoiding re-triggering Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * enhancements, refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not flood of healthy checks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not list obvious backends as text backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * tests fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop redundant healthcheck Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * enhancements, refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-30 00:47:27 +02:00

5 Commits