mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 04:16:56 -04:00
master
3 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
5cda4f1ccf |
fix(L4T13 backends): switch vllm/sglang/vllm-omni to PyPI aarch64+cu130 wheels (#9950)
* fix(vllm): switch L4T13 backend to PyPI aarch64+cu130 wheels The L4T13 vllm backend pulled torch / torchvision / torchaudio / vllm from pypi.jetson-ai-lab.io's sbsa/cu130 mirror via [tool.uv.sources] with no version pins. That mirror started shipping torch 2.11.0 next to a vllm-0.20.0+cu130 wheel that was still compiled against torch 2.10's c10 ABI, so uv landed on the mismatched pair and vllm crashed at import: ImportError: vllm/_C.abi3.so: undefined symbol: _ZN3c1013MessageLoggerC1EPKciib (c10::MessageLogger's constructor signature changed between torch 2.10 and 2.11; the vllm wheel referenced the 2.10 form, the installed libc10.so exported only the 2.11 form.) Since torch 2.11 (April 2026) PyPI publishes its own aarch64 + cu130 manylinux wheels, and vllm 0.20.0 ships an aarch64 wheel whose Requires- Dist locks torch==2.11.0 / torchvision==0.26.0 / torchaudio==2.11.0. That makes uv's resolver produce an ABI-consistent set automatically, so the mirror and the [tool.uv.sources] pinning are no longer needed. flash-attn is dropped from the dep list: PyPI has no aarch64 wheel, but vLLM 0.20+ already bundles its own vllm_flash_attn (fa2 + fa3) inside the main wheel, so the Dao-AILab package isn't required at runtime. Reference: https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/ Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] [WebFetch] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(vllm): retire l4t13 pyproject.toml in favor of requirements-*.txt pyproject.toml only existed because uv pip install -r requirements.txt doesn't honor [tool.uv.sources]. The previous commit dropped [tool.uv. sources] (PyPI now serves the aarch64 + cu130 wheels directly), so the file no longer carries any logic the requirements-*.txt path can't. Replace with the same two-file pattern every other build profile uses: - requirements-l4t13.txt (accelerate / torch / transformers / bitsandbytes - matches cublas13's split) - requirements-l4t13-after.txt (vllm; runs after the base resolve so the cu130 torch wheel lands first) install.sh's whole l4t13 elif branch goes away; libbackend.sh's installRequirements already handles the requirements-install.txt build- deps pass, the C_INCLUDE_PATH export for PORTABLE_PYTHON, and the runProtogen call, so falling through to the standard else: branch produces identical install behavior with less surface area. No functional change at install time - same wheels, same order. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang,vllm-omni): switch L4T13 backends to PyPI aarch64+cu130 wheels Same root cause and same fix as the vllm backend in the previous commits: the L4T13 sglang and vllm-omni backends both pulled their accelerator stack from pypi.jetson-ai-lab.io's sbsa/cu130 mirror with no version pins, so they would silently land on the same torch 2.11 vs cu130-built wheel ABI mismatch the moment the mirror published an out-of-sync pair. sglang ------ - Drop pyproject.toml + [tool.uv.sources]. The historical comment said the [all] extra was unsafe on aarch64 because of decord, but sglang 0.5.x now uses `decord2` on aarch64/arm/armv7l (which ships cp312 aarch64 wheels), so we can match cublas13's sglang[all]>=0.5.11 pin and stop being capped at the 0.5.1.post2 the L4T mirror shipped. That unblocks Gemma 4 / MTP recipes on Jetson Thor. - New requirements-l4t13.txt mirrors the cublas13 split (accelerate / torch / torchvision / torchaudio / transformers), requirements-l4t13- after.txt carries sglang[all]>=0.5.11. - install.sh's l4t13 elif branch goes away; falls through to the standard installRequirements path. vllm-omni --------- - requirements-l4t13.txt drops --extra-index-url to jetson-ai-lab and drops flash-attn (PyPI has no aarch64 wheel, vLLM 0.20+ bundles its own vllm_flash_attn fa2 + fa3 internally). - install.sh's l4t13 vllm-install branch collapses into the cublas13 branch since both now just run `pip install vllm --torch-backend=auto` against PyPI. - --index-strategy=unsafe-best-match is dropped from the top-level l4t13 guard; without the L4T mirror in the picture it had no purpose. The from-source vllm-omni install on top still keeps its existing `sed -i '/^fa3-fwd[[:space:]]*==/d' requirements/cuda.txt` workaround - fa3-fwd has no aarch64 wheel and no sdist, unrelated to flash-attn. Reference: https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/ Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] [WebFetch] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang): drop [all] extra on l4t13 - xatlas has no aarch64 wheel CI revealed that sglang[all]==0.5.12 transitively pulls xatlas via the [diffusion] sub-extra, and xatlas ships no aarch64 wheel. Its sdist depends on scikit_build_core without declaring it in build-system. requires, so under --no-build-isolation uv can't build it from source: × Failed to build `xatlas==0.0.11` ├─▶ The build backend returned an error ╰─▶ Call to `scikit_build_core.build.build_wheel` failed (exit status: 1) ModuleNotFoundError: No module named 'scikit_build_core' help: `xatlas` (v0.0.11) was included because `sglang[all]` (v0.5.12) depends on `xatlas` Upstream sglang explicitly gates st_attn and vsa on `platform_machine != aarch64` inside the same [diffusion] extra but forgot xatlas - same class of bug that bit the old decord pin. Use plain `sglang>=0.5.11` on l4t13. backend.py imports only base sglang.srt symbols (Engine, ServerArgs, FunctionCallParser, ReasoningParser); the [all] extras are optional accelerators not required at import time. cublas13 (x86_64) keeps [all] because xatlas has x86_64 wheels there. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
8e43842175 |
feat(vllm, distributed): tensor parallel distributed workers (#9612)
* feat(vllm): build vllm from source for Intel XPU
Upstream publishes no XPU wheels for vllm. The Intel profile was
silently picking up a non-XPU wheel that imported but errored at
engine init, and several runtime deps (pillow, charset-normalizer,
chardet) were missing on Intel -- backend.py crashed at import time
before the gRPC server came up.
Switch the Intel profile to upstream's documented from-source
procedure (docs/getting_started/installation/gpu.xpu.inc.md in
vllm-project/vllm):
- Bump portable Python to 3.12 -- vllm-xpu-kernels ships only a
cp312 wheel.
- Source /opt/intel/oneapi/setvars.sh so vllm's CMake build sees
the dpcpp/sycl compiler from the oneapi-basekit base image.
- Hide requirements-intel-after.txt during installRequirements
(it used to 'pip install vllm'); install vllm's deps from a
fresh git clone of vllm via 'uv pip install -r
requirements/xpu.txt', swap stock triton for
triton-xpu==3.7.0, then 'VLLM_TARGET_DEVICE=xpu uv pip install
--no-deps .'.
- requirements-intel.txt trimmed to LocalAI's direct deps
(accelerate / transformers / bitsandbytes); torch-xpu, vllm,
vllm_xpu_kernels and the rest come from upstream's xpu.txt
during the source build.
- requirements.txt: add pillow + charset-normalizer + chardet --
used by backend.py and missing on the Intel install profile.
- run.sh: 'set -x' so backend startup is visible in container
logs (the gRPC startup error path was previously opaque).
Also adds a one-line docs example for engine_args.attention_backend
under the vLLM section, since older XE-HPG GPUs (e.g. Arc A770)
need TRITON_ATTN to bypass the cutlass path in vllm_xpu_kernels.
Tested end-to-end on an Intel Arc A770 with Qwen2.5-0.5B-Instruct
via LocalAI's /v1/chat/completions.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(vllm): add multi-node data-parallel follower worker
vLLM v1's multi-node story is one process per node sharing a DP
coordinator over ZMQ -- the head runs the API server with
data_parallel_size > 1 and followers run `vllm serve --headless ...`
with matching topology. Today LocalAI can already configure DP on the
head via the engine_args YAML map, but there's no way to bring up the
follower nodes -- so the head sits waiting for ranks that never
handshake.
Add `local-ai p2p-worker vllm`, mirroring MLXDistributed's structural
precedent (operator-launched, static config, no NATS placement). The
worker:
- Optionally self-registers with the frontend as an agent-type node
tagged `node.role=vllm-follower` so it's visible in the admin UI
and operators can scope ordinary models away via inverse
selectors.
- Resolves the platform-specific vllm backend via the gallery's
"vllm" meta-entry (cuda*, intel-vllm, rocm-vllm, ...).
- Runs vLLM as a child process so the heartbeat goroutine survives
until vLLM exits; forwards SIGINT/SIGTERM so vLLM can clean up its
ZMQ sockets before we tear down.
- Validates --headless + --start-rank 0 is rejected (rank 0 is the
head and must serve the API).
Backend run.sh dispatches `serve` as the first arg to vllm's own CLI
instead of LocalAI's backend.py gRPC server -- the follower speaks
ZMQ directly to the head, there is no LocalAI gRPC on the follower
side. Single-node usage is unchanged.
Generalises the gallery resolution helper into findBackendPath()
shared by MLX and vLLM workers; extracts ParseNodeLabels for the
comma-separated label parsing both use.
Ships with two compose recipes (`docker-compose.vllm-multinode.yaml`
for NVIDIA, `docker-compose.vllm-multinode.intel.yaml` for Intel
XPU/xccl) plus `tests/e2e/vllm-multinode/smoke.sh`. Both vendors are
supported (NCCL for CUDA/ROCm, xccl for XPU) but mixed-vendor DP is
not -- PyTorch's process group requires every rank to use the same
collective backend, and NCCL/xccl/gloo don't interoperate.
Out of scope (deferred): SmartRouter-driven placement of follower
ranks via NATS backend.install events, follower log streaming through
/api/backend-logs, tensor-parallel across nodes, disaggregated
prefill via KVTransferConfig.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* test(vllm): CPU-only end-to-end test for multi-node DP
Adds tests/e2e/vllm-multinode/, a Ginkgo + testcontainers-go suite
that brings up a head + headless follower from the locally-built
local-ai:tests image, bind-mounts the cpu-vllm backend extracted by
make extract-backend-vllm so it's seen as a system backend (no gallery
fetch, no registry server), and asserts a chat completion across both
DP ranks. New `make test-e2e-vllm-multinode` target wires the docker
build, backend extract, and ginkgo run together; BuildKit caches both
images so re-runs only rebuild what changed. Tagged Label("VLLMMultinode")
so the existing distributed suite isn't pulled along.
Two pre-existing bugs surfaced by the test:
1. extract-backend-% (Makefile) failed for every backend, because all
backend images end with `FROM scratch` and `docker create` rejects
an image with no CMD/ENTRYPOINT. Fixed by passing
--entrypoint=/run.sh -- the container is never started, only
docker-cp'd, so the path doesn't have to exist; we just need
anything that satisfies the daemon's create-time validation.
2. backend/python/vllm/run.sh's `serve` shortcut for the multi-node DP
follower exec'd ${EDIR}/venv/bin/vllm directly, but uv bakes an
absolute build-time shebang (`#!/vllm/venv/bin/python3`) that no
longer resolves once the backend is relocated to BackendsPath.
_makeVenvPortable's shebang rewriter only matches paths that
already point at ${EDIR}, so the original shebang slips through
unchanged. Fixed by exec-ing ${EDIR}/venv/bin/python with the script
as an argument -- Python ignores the script's shebang in that case.
The test fixture caps memory aggressively (max_model_len=512,
VLLM_CPU_KVCACHE_SPACE=1, TORCH_COMPILE_DISABLE=1) so two CPU engines
fit on a 32 GB box. TORCH_COMPILE_DISABLE is currently mandatory for
cpu-vllm: torch._inductor's CPU-ISA probe runs even with
enforce_eager=True and needs g++ on PATH, which the LocalAI runtime
image doesn't ship -- to be addressed in a follow-up that bundles a
toolchain in the cpu-vllm backend.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(vllm): bundle a g++ toolchain in the cpu-vllm backend image
torch._inductor's CPU-ISA probe (`cpu_model_runner.py:65 "Warming up
model for the compilation"`) shells out to `g++` at vllm engine
startup, regardless of `enforce_eager=True` -- the eager flag only
disables CUDA graphs, not inductor's first-batch warmup. The LocalAI
CPU runtime image (Dockerfile, unconditional apt list) does not ship
build-essential, and the cpu-vllm backend image is `FROM scratch`,
so any non-trivial inference on cpu-vllm crashes with:
torch._inductor.exc.InductorError:
InvalidCxxCompiler: No working C++ compiler found in
torch._inductor.config.cpp.cxx: (None, 'g++')
Bundling the toolchain in the CPU runtime image would bloat every
non-vllm-CPU deployment and force a single GCC version on backends
that may want clang or a different version. So this lives in the
backend, gated to BUILD_TYPE=='' (the CPU profile).
`package.sh` snapshots g++ + binutils + cc1plus + libstdc++ + libc6
(runtime + dev) + the math libs cc1plus links (libisl/libmpc/libmpfr/
libjansson) into ${BACKEND}/toolchain/, mirroring /usr/... layout. The
unversioned binaries on Debian/Ubuntu are symlink chains pointing into
multiarch packages (`g++` -> `g++-13` -> `x86_64-linux-gnu-g++-13`,
the latter in `g++-13-x86-64-linux-gnu`), so the package list resolves
both the version and the arch-triplet variant. Symlinks /lib ->
usr/lib and /lib64 -> usr/lib64 are recreated under the toolchain
root because Ubuntu's UsrMerge keeps them at /, and ld scripts
(`libc.so`, `libm.so`) hardcode `/lib/...` paths that --sysroot
re-roots into the toolchain.
The unversioned `g++`/`gcc`/`cpp` symlinks are replaced with wrapper
shell scripts that resolve their own location at runtime and pass
`--sysroot=<toolchain>` and `-B <toolchain>/usr/lib/gcc/<triplet>/<ver>/`
to the underlying versioned binary. That's how torch's bare `g++ foo.cpp
-o foo` invocation finds cc1plus (-B), system headers (--sysroot), and
the bundled libstdc++ (--sysroot, --sysroot is recursive into linker).
`run.sh` adds the toolchain bin dir to PATH and the toolchain's
shared-lib dir to LD_LIBRARY_PATH -- everything else (header search,
linker search, executable search) is encapsulated in the wrappers.
No-op for non-CPU builds, the dir doesn't exist there.
The cpu-vllm image grows by ~217 MB. Tradeoff is acceptable -- cpu-vllm
is already a niche profile (few users compared to GPU vllm) and the
alternative is a backend that crashes at first inference unless the
operator manually sets TORCH_COMPILE_DISABLE=1, which silently disables
all torch.compile optimizations.
Drops `TORCH_COMPILE_DISABLE=1` from tests/e2e/vllm-multinode -- the
smoke now exercises the real compile path through the bundled toolchain.
Test runtime is +20s for the warmup compile, still <90s end to end.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(vllm): scope jetson-ai-lab index to L4T-specific wheels via pyproject.toml
The L4T arm64 build resolves dependencies through pypi.jetson-ai-lab.io,
which hosts the L4T-specific torch / vllm / flash-attn wheels but also
transparently proxies the rest of PyPI through `/+f/<sha>/<filename>`
URLs. With `--extra-index-url` + `--index-strategy=unsafe-best-match`
uv would pick those proxy URLs for ordinary PyPI packages —
anthropic/openai/propcache/annotated-types — and fail when the proxy
503s. Master is hitting the same bug on its own l4t-vllm matrix entry.
Switch the l4t13 install path to a pyproject.toml that marks the
jetson-ai-lab index `explicit = true` and pins only torch, torchvision,
torchaudio, flash-attn, and vllm to it via [tool.uv.sources]. uv won't
consult the L4T mirror for anything else, so transitive deps fall back
to PyPI as the default index — no exposure to the proxy 503s.
`uv pip install -r requirements.txt` ignores [tool.uv.sources], so the
l4t13 branch in install.sh now invokes `uv pip install --requirement
pyproject.toml` directly, replacing the old requirements-l4t13*.txt
files. Other BUILD_PROFILEs continue using libbackend.sh's
installRequirements and never read pyproject.toml.
Local resolution test (x86_64, dry-run) confirms uv hits the L4T
index for torch and falls through to PyPI for everything else.
Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
|
||
|
|
24505e57f5 |
feat(backends): add CUDA 13 + L4T arm64 CUDA 13 variants for vllm/vllm-omni/sglang (#9553)
* feat(backends): add CUDA 13 + L4T arm64 CUDA 13 variants for vllm/vllm-omni/sglang
Adds new build profiles mirroring the diffusers/ace-step pattern so vLLM
serving (and SGLang on arm64) can be deployed on CUDA 13 hosts and
JetPack 7 boards:
- vllm: cublas13 (PyPI cu130 channel) + l4t13 (jetson-ai-lab SBSA cu130
prebuilt vllm + flash-attn).
- vllm-omni: cublas13 + l4t13. Floats vllm version on cu13 since vllm
0.19+ ships cu130 wheels by default and vllm-omni tracks vllm master;
cu12 path keeps the 0.14.0 pin to avoid disturbing existing images.
- sglang: l4t13 arm64 only — uses the prebuilt sglang wheel from the
jetson-ai-lab SBSA cu130 index, so no source build is needed.
Cublas13 sglang on x86_64 is intentionally deferred.
CI matrix gains five new images (-gpu-nvidia-cuda-13-vllm{,-omni},
-nvidia-l4t-cuda-13-arm64-{vllm,vllm-omni,sglang}); backend/index.yaml
gains the matching capability keys (nvidia-cuda-13, nvidia-l4t-cuda-13)
and latest/development merge entries.
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
* fix(backends): use unsafe-best-match index strategy on l4t13 builds
The jetson-ai-lab SBSA cu130 index lists transitive deps (decord, etc.)
at limited versions / older Python ABIs. uv defaults to the first index
that contains a package and refuses to fall through to PyPI, so sglang
l4t13 build fails resolving decord. Mirror the existing cpu sglang
profile by setting --index-strategy=unsafe-best-match on l4t13 across
the three backends, and apply it to the explicit vllm install line in
vllm-omni's install.sh (which doesn't honor EXTRA_PIP_INSTALL_FLAGS).
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash]
* fix(sglang): drop [all] extras on l4t13, floor version at 0.5.0
The [all] extra brings in outlines→decord, and decord has no aarch64
cp312 wheel on PyPI nor the jetson-ai-lab index (only legacy cp35-cp37
tags). With unsafe-best-match enabled, uv backtracked through sglang
versions trying to satisfy decord and silently landed on
sglang==0.1.16, an ancient version with an entirely different dep
tree (cloudpickle/outlines 0.0.44, etc.).
Drop [all] so decord is no longer required, and floor sglang at 0.5.0
to prevent any future resolver misfire from degrading the version
again.
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|