feat(vllm, distributed): tensor parallel distributed workers (#9612)

* feat(vllm): build vllm from source for Intel XPU Upstream publishes no XPU wheels for vllm. The Intel profile was silently picking up a non-XPU wheel that imported but errored at engine init, and several runtime deps (pillow, charset-normalizer, chardet) were missing on Intel -- backend.py crashed at import time before the gRPC server came up. Switch the Intel profile to upstream's documented from-source procedure (docs/getting_started/installation/gpu.xpu.inc.md in vllm-project/vllm): - Bump portable Python to 3.12 -- vllm-xpu-kernels ships only a cp312 wheel. - Source /opt/intel/oneapi/setvars.sh so vllm's CMake build sees the dpcpp/sycl compiler from the oneapi-basekit base image. - Hide requirements-intel-after.txt during installRequirements (it used to 'pip install vllm'); install vllm's deps from a fresh git clone of vllm via 'uv pip install -r requirements/xpu.txt', swap stock triton for triton-xpu==3.7.0, then 'VLLM_TARGET_DEVICE=xpu uv pip install --no-deps .'. - requirements-intel.txt trimmed to LocalAI's direct deps (accelerate / transformers / bitsandbytes); torch-xpu, vllm, vllm_xpu_kernels and the rest come from upstream's xpu.txt during the source build. - requirements.txt: add pillow + charset-normalizer + chardet -- used by backend.py and missing on the Intel install profile. - run.sh: 'set -x' so backend startup is visible in container logs (the gRPC startup error path was previously opaque). Also adds a one-line docs example for engine_args.attention_backend under the vLLM section, since older XE-HPG GPUs (e.g. Arc A770) need TRITON_ATTN to bypass the cutlass path in vllm_xpu_kernels. Tested end-to-end on an Intel Arc A770 with Qwen2.5-0.5B-Instruct via LocalAI's /v1/chat/completions. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(vllm): add multi-node data-parallel follower worker vLLM v1's multi-node story is one process per node sharing a DP coordinator over ZMQ -- the head runs the API server with data_parallel_size > 1 and followers run `vllm serve --headless ...` with matching topology. Today LocalAI can already configure DP on the head via the engine_args YAML map, but there's no way to bring up the follower nodes -- so the head sits waiting for ranks that never handshake. Add `local-ai p2p-worker vllm`, mirroring MLXDistributed's structural precedent (operator-launched, static config, no NATS placement). The worker: - Optionally self-registers with the frontend as an agent-type node tagged `node.role=vllm-follower` so it's visible in the admin UI and operators can scope ordinary models away via inverse selectors. - Resolves the platform-specific vllm backend via the gallery's "vllm" meta-entry (cuda*, intel-vllm, rocm-vllm, ...). - Runs vLLM as a child process so the heartbeat goroutine survives until vLLM exits; forwards SIGINT/SIGTERM so vLLM can clean up its ZMQ sockets before we tear down. - Validates --headless + --start-rank 0 is rejected (rank 0 is the head and must serve the API). Backend run.sh dispatches `serve` as the first arg to vllm's own CLI instead of LocalAI's backend.py gRPC server -- the follower speaks ZMQ directly to the head, there is no LocalAI gRPC on the follower side. Single-node usage is unchanged. Generalises the gallery resolution helper into findBackendPath() shared by MLX and vLLM workers; extracts ParseNodeLabels for the comma-separated label parsing both use. Ships with two compose recipes (`docker-compose.vllm-multinode.yaml` for NVIDIA, `docker-compose.vllm-multinode.intel.yaml` for Intel XPU/xccl) plus `tests/e2e/vllm-multinode/smoke.sh`. Both vendors are supported (NCCL for CUDA/ROCm, xccl for XPU) but mixed-vendor DP is not -- PyTorch's process group requires every rank to use the same collective backend, and NCCL/xccl/gloo don't interoperate. Out of scope (deferred): SmartRouter-driven placement of follower ranks via NATS backend.install events, follower log streaming through /api/backend-logs, tensor-parallel across nodes, disaggregated prefill via KVTransferConfig. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(vllm): CPU-only end-to-end test for multi-node DP Adds tests/e2e/vllm-multinode/, a Ginkgo + testcontainers-go suite that brings up a head + headless follower from the locally-built local-ai:tests image, bind-mounts the cpu-vllm backend extracted by make extract-backend-vllm so it's seen as a system backend (no gallery fetch, no registry server), and asserts a chat completion across both DP ranks. New `make test-e2e-vllm-multinode` target wires the docker build, backend extract, and ginkgo run together; BuildKit caches both images so re-runs only rebuild what changed. Tagged Label("VLLMMultinode") so the existing distributed suite isn't pulled along. Two pre-existing bugs surfaced by the test: 1. extract-backend-% (Makefile) failed for every backend, because all backend images end with `FROM scratch` and `docker create` rejects an image with no CMD/ENTRYPOINT. Fixed by passing --entrypoint=/run.sh -- the container is never started, only docker-cp'd, so the path doesn't have to exist; we just need anything that satisfies the daemon's create-time validation. 2. backend/python/vllm/run.sh's `serve` shortcut for the multi-node DP follower exec'd ${EDIR}/venv/bin/vllm directly, but uv bakes an absolute build-time shebang (`#!/vllm/venv/bin/python3`) that no longer resolves once the backend is relocated to BackendsPath. _makeVenvPortable's shebang rewriter only matches paths that already point at ${EDIR}, so the original shebang slips through unchanged. Fixed by exec-ing ${EDIR}/venv/bin/python with the script as an argument -- Python ignores the script's shebang in that case. The test fixture caps memory aggressively (max_model_len=512, VLLM_CPU_KVCACHE_SPACE=1, TORCH_COMPILE_DISABLE=1) so two CPU engines fit on a 32 GB box. TORCH_COMPILE_DISABLE is currently mandatory for cpu-vllm: torch._inductor's CPU-ISA probe runs even with enforce_eager=True and needs g++ on PATH, which the LocalAI runtime image doesn't ship -- to be addressed in a follow-up that bundles a toolchain in the cpu-vllm backend. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(vllm): bundle a g++ toolchain in the cpu-vllm backend image torch._inductor's CPU-ISA probe (`cpu_model_runner.py:65 "Warming up model for the compilation"`) shells out to `g++` at vllm engine startup, regardless of `enforce_eager=True` -- the eager flag only disables CUDA graphs, not inductor's first-batch warmup. The LocalAI CPU runtime image (Dockerfile, unconditional apt list) does not ship build-essential, and the cpu-vllm backend image is `FROM scratch`, so any non-trivial inference on cpu-vllm crashes with: torch._inductor.exc.InductorError: InvalidCxxCompiler: No working C++ compiler found in torch._inductor.config.cpp.cxx: (None, 'g++') Bundling the toolchain in the CPU runtime image would bloat every non-vllm-CPU deployment and force a single GCC version on backends that may want clang or a different version. So this lives in the backend, gated to BUILD_TYPE=='' (the CPU profile). `package.sh` snapshots g++ + binutils + cc1plus + libstdc++ + libc6 (runtime + dev) + the math libs cc1plus links (libisl/libmpc/libmpfr/ libjansson) into ${BACKEND}/toolchain/, mirroring /usr/... layout. The unversioned binaries on Debian/Ubuntu are symlink chains pointing into multiarch packages (`g++` -> `g++-13` -> `x86_64-linux-gnu-g++-13`, the latter in `g++-13-x86-64-linux-gnu`), so the package list resolves both the version and the arch-triplet variant. Symlinks /lib -> usr/lib and /lib64 -> usr/lib64 are recreated under the toolchain root because Ubuntu's UsrMerge keeps them at /, and ld scripts (`libc.so`, `libm.so`) hardcode `/lib/...` paths that --sysroot re-roots into the toolchain. The unversioned `g++`/`gcc`/`cpp` symlinks are replaced with wrapper shell scripts that resolve their own location at runtime and pass `--sysroot=<toolchain>` and `-B <toolchain>/usr/lib/gcc/<triplet>/<ver>/` to the underlying versioned binary. That's how torch's bare `g++ foo.cpp -o foo` invocation finds cc1plus (-B), system headers (--sysroot), and the bundled libstdc++ (--sysroot, --sysroot is recursive into linker). `run.sh` adds the toolchain bin dir to PATH and the toolchain's shared-lib dir to LD_LIBRARY_PATH -- everything else (header search, linker search, executable search) is encapsulated in the wrappers. No-op for non-CPU builds, the dir doesn't exist there. The cpu-vllm image grows by ~217 MB. Tradeoff is acceptable -- cpu-vllm is already a niche profile (few users compared to GPU vllm) and the alternative is a backend that crashes at first inference unless the operator manually sets TORCH_COMPILE_DISABLE=1, which silently disables all torch.compile optimizations. Drops `TORCH_COMPILE_DISABLE=1` from tests/e2e/vllm-multinode -- the smoke now exercises the real compile path through the bundled toolchain. Test runtime is +20s for the warmup compile, still <90s end to end. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(vllm): scope jetson-ai-lab index to L4T-specific wheels via pyproject.toml The L4T arm64 build resolves dependencies through pypi.jetson-ai-lab.io, which hosts the L4T-specific torch / vllm / flash-attn wheels but also transparently proxies the rest of PyPI through `/+f/<sha>/<filename>` URLs. With `--extra-index-url` + `--index-strategy=unsafe-best-match` uv would pick those proxy URLs for ordinary PyPI packages — anthropic/openai/propcache/annotated-types — and fail when the proxy 503s. Master is hitting the same bug on its own l4t-vllm matrix entry. Switch the l4t13 install path to a pyproject.toml that marks the jetson-ai-lab index `explicit = true` and pins only torch, torchvision, torchaudio, flash-attn, and vllm to it via [tool.uv.sources]. uv won't consult the L4T mirror for anything else, so transitive deps fall back to PyPI as the default index — no exposure to the proxy 503s. `uv pip install -r requirements.txt` ignores [tool.uv.sources], so the l4t13 branch in install.sh now invokes `uv pip install --requirement pyproject.toml` directly, replacing the old requirements-l4t13*.txt files. Other BUILD_PROFILEs continue using libbackend.sh's installRequirements and never read pyproject.toml. Local resolution test (x86_64, dry-run) confirms uv hits the L4T index for torch and falls through to PyPI for everything else. Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-17 04:56:52 -04:00 · 2026-05-05 23:22:50 +01:00
parent 503904d311
commit 8e43842175
20 changed files with 1090 additions and 74 deletions
--- a/backend/python/vllm/install.sh
+++ b/backend/python/vllm/install.sh
@@ -18,12 +18,15 @@ else
    source $backend_dir/../common/libbackend.sh
 fi

-# This is here because the Intel pip index is broken and returns 200 status codes for every package name, it just doesn't return any package links.
-# This makes uv think that the package exists in the Intel pip index, and by default it stops looking at other pip indexes once it finds a match.
-# We need uv to continue falling through to the pypi default index to find optimum[openvino] in the pypi index
-# the --upgrade actually allows us to *downgrade* torch to the version provided in the Intel pip index
+# Intel XPU: torch==2.11.0+xpu lives on the PyTorch XPU index, transitive
+# deps on PyPI — unsafe-best-match lets uv mix both. vllm-xpu-kernels only
+# ships a python3.12 wheel per upstream docs, so bump the portable Python
+# before installRequirements (matches the l4t13 pattern below).
+# https://github.com/vllm-project/vllm/blob/main/docs/getting_started/installation/gpu.xpu.inc.md
 if [ "x${BUILD_PROFILE}" == "xintel" ]; then
-    EXTRA_PIP_INSTALL_FLAGS+=" --upgrade --index-strategy=unsafe-first-match"
+    PYTHON_VERSION="3.12"
+    PYTHON_PATCH="11"
+    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
 fi

 # CPU builds need unsafe-best-match to pull torch==2.10.0+cpu from the
@@ -42,10 +45,12 @@ fi

 # JetPack 7 / L4T arm64 wheels (torch, vllm, flash-attn) live on
 # pypi.jetson-ai-lab.io and are built for cp312, so bump the venv Python
-# accordingly. JetPack 6 keeps cp310 + USE_PIP=true. unsafe-best-match
-# is required because the jetson-ai-lab index lists transitive deps at
-# limited versions — without it uv pins to the first matching index and
-# fails to resolve a compatible wheel from PyPI.
+# accordingly. JetPack 6 keeps cp310 + USE_PIP=true.
+#
+# l4t13 uses pyproject.toml (see the elif branch below) to pin only the
+# L4T-specific wheels to the jetson-ai-lab index via [tool.uv.sources].
+# That keeps PyPI as the resolution path for transitive deps like
+# anthropic/openai/propcache, which the L4T mirror's proxy 503s on.
 if [ "x${BUILD_PROFILE}" == "xl4t12" ]; then
    USE_PIP=true
 fi
@@ -53,16 +58,77 @@ if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
    PYTHON_VERSION="3.12"
    PYTHON_PATCH="12"
    PY_STANDALONE_TAG="20251120"
-    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
 fi

+# Intel XPU has no upstream-published vllm wheels, so we always build vllm
+# from source against torch-xpu and replace the default triton with
+# triton-xpu (matching torch 2.11). Mirrors the upstream procedure:
+# https://github.com/vllm-project/vllm/blob/main/docs/getting_started/installation/gpu.xpu.inc.md
+if [ "x${BUILD_TYPE}" == "xintel" ]; then
+    # Hide requirements-intel-after.txt so installRequirements doesn't
+    # try `pip install vllm` (would either fail or grab a non-XPU wheel).
+    _intel_after="${backend_dir}/requirements-intel-after.txt"
+    _intel_after_bak=""
+    if [ -f "${_intel_after}" ]; then
+        _intel_after_bak="${_intel_after}.xpu.bak"
+        mv "${_intel_after}" "${_intel_after_bak}"
+    fi
+    installRequirements
+    if [ -n "${_intel_after_bak}" ]; then
+        mv "${_intel_after_bak}" "${_intel_after}"
+    fi
+
+    # vllm's CMake build needs the Intel oneAPI dpcpp/sycl compiler — the
+    # base image (intel/oneapi-basekit) has it but the env isn't sourced.
+    if [ -f /opt/intel/oneapi/setvars.sh ]; then
+        set +u
+        source /opt/intel/oneapi/setvars.sh --force
+        set -u
+    fi
+
+    _vllm_src=$(mktemp -d)
+    trap 'rm -rf "${_vllm_src}"' EXIT
+    git clone --depth 1 https://github.com/vllm-project/vllm "${_vllm_src}/vllm"
+    pushd "${_vllm_src}/vllm"
+        # Install vllm's own runtime deps (torch-xpu, vllm_xpu_kernels,
+        # pydantic, fastapi, …) from upstream's requirements/xpu.txt — the
+        # canonical source of truth. Avoids re-pinning everything ourselves.
+        uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} -r requirements/xpu.txt
+        # Stock triton (NVIDIA-only) may have come in transitively; replace
+        # with triton-xpu==3.7.0 which matches torch 2.11.
+        uv pip uninstall triton triton-xpu 2>/dev/null || true
+        uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} \
+            --extra-index-url https://download.pytorch.org/whl/xpu \
+            triton-xpu==3.7.0
+        export CMAKE_PREFIX_PATH="$(python -c 'import site; print(site.getsitepackages()[0])'):${CMAKE_PREFIX_PATH:-}"
+        VLLM_TARGET_DEVICE=xpu uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --no-deps .
+    popd
+# L4T arm64 (JetPack 7): drive the install through pyproject.toml so that
+# [tool.uv.sources] can pin torch/vllm/flash-attn/torchvision/torchaudio
+# to the jetson-ai-lab index, while everything else (transitive deps and
+# PyPI-resolvable packages like transformers) comes from PyPI. Bypasses
+# installRequirements because uv pip install -r requirements.txt does not
+# honor sources — see backend/python/vllm/pyproject.toml for the rationale.
+elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
+    ensureVenv
+    if [ "x${PORTABLE_PYTHON}" == "xtrue" ]; then
+        export C_INCLUDE_PATH="${C_INCLUDE_PATH:-}:$(_portable_dir)/include/python${PYTHON_VERSION}"
+    fi
+    pushd "${backend_dir}"
+        # Build deps first (matches installRequirements' requirements-install.txt
+        # pass — fastsafetensors and friends need pybind11 in the venv before
+        # their sdists can build under --no-build-isolation).
+        uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} -r requirements-install.txt
+        uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --requirement pyproject.toml
+    popd
+    runProtogen
 # FROM_SOURCE=true on a CPU build skips the prebuilt vllm wheel in
 # requirements-cpu-after.txt and compiles vllm locally against the host's
 # actual CPU. Not used by default because it takes ~30-40 minutes, but
 # kept here for hosts where the prebuilt wheel SIGILLs (CPU without the
 # required SIMD baseline, e.g. AVX-512 VNNI/BF16). Default CI uses a
 # bigger-runner with compatible hardware instead.
-if [ "x${BUILD_TYPE}" == "x" ] && [ "x${FROM_SOURCE:-}" == "xtrue" ]; then
+elif [ "x${BUILD_TYPE}" == "x" ] && [ "x${FROM_SOURCE:-}" == "xtrue" ]; then
    # Temporarily hide the prebuilt wheel so installRequirements doesn't
    # pull it — the rest of the requirements files (base deps, torch,
    # transformers) are still installed normally.
--- a/backend/python/vllm/package.sh
+++ b/backend/python/vllm/package.sh
@@ -45,5 +45,109 @@ copy_with_symlinks() {
 copy_with_symlinks libnuma.so.1
 copy_with_symlinks libgomp.so.1

+# CPU profile only: bundle a g++ toolchain so torch._inductor's
+# ISA probe (always run at vllm engine startup, regardless of
+# enforce_eager) finds a C++ compiler. The LocalAI runtime image
+# is FROM ubuntu:24.04 with a minimal apt list that does not
+# include build-essential, and the backend image itself is FROM
+# scratch -- so without this, cpu-vllm crashes with
+# torch._inductor.exc.InvalidCxxCompiler at first inference
+# unless the operator manually sets TORCH_COMPILE_DISABLE=1.
+#
+# We snapshot every file owned by the toolchain packages, mirroring
+# the /usr/... layout into ${BACKEND}/toolchain/ so g++ can find
+# cc1plus, headers, libs etc. via GCC_EXEC_PREFIX / CPATH /
+# LIBRARY_PATH at runtime (libbackend.sh wires those up). Adds
+# ~400 MB to the cpu-vllm image, which is tolerable -- cpu-vllm is
+# already a niche profile.
+if [ "${BUILD_TYPE:-}" = "" ] && command -v dpkg-query >/dev/null 2>&1; then
+    TOOLCHAIN_DIR="${CURDIR}/toolchain"
+    mkdir -p "${TOOLCHAIN_DIR}"
+    # The unversioned g++/gcc packages on Debian/Ubuntu only ship
+    # symlinks; the actual binaries live in g++-${VER}/gcc-${VER}.
+    # Discover the active version so the symlink targets get bundled
+    # along with their owners.
+    GCC_VER=$(gcc -dumpversion 2>/dev/null | cut -d. -f1 || true)
+    # `g++-${VER}` itself is just another symlink layer on Debian/
+    # Ubuntu — the real binary `x86_64-linux-gnu-g++-${VER}` lives
+    # in `g++-${VER}-x86-64-linux-gnu` (a separate package pulled in
+    # as a dependency). Same story for gcc/cpp. Compute the dpkg
+    # arch-triplet to find the right package name for both amd64 and
+    # arm64 hosts.
+    case "$(dpkg --print-architecture 2>/dev/null)" in
+        amd64) HOST_TRIPLET="x86-64-linux-gnu" ;;
+        arm64) HOST_TRIPLET="aarch64-linux-gnu" ;;
+        *)     HOST_TRIPLET="" ;;
+    esac
+    PKGS=(g++ gcc cpp libstdc++-${GCC_VER}-dev libgcc-${GCC_VER}-dev libc6 libc6-dev binutils binutils-common libbinutils libc-dev-bin linux-libc-dev libcrypt-dev libgomp1 libstdc++6 libgcc-s1 libisl23 libmpc3 libmpfr6 libjansson4 libctf0 libctf-nobfd0 libsframe1)
+    if [ -n "${GCC_VER}" ]; then
+        PKGS+=("g++-${GCC_VER}" "gcc-${GCC_VER}" "cpp-${GCC_VER}" "gcc-${GCC_VER}-base")
+        if [ -n "${HOST_TRIPLET}" ]; then
+            PKGS+=(
+                "g++-${GCC_VER}-${HOST_TRIPLET}"
+                "gcc-${GCC_VER}-${HOST_TRIPLET}"
+                "cpp-${GCC_VER}-${HOST_TRIPLET}"
+                "binutils-${HOST_TRIPLET}"
+            )
+        fi
+    fi
+    for pkg in "${PKGS[@]}"; do
+        if ! dpkg-query -W "${pkg}" >/dev/null 2>&1; then
+            continue
+        fi
+        # Copy each owned path, preserving symlinks and mode. We
+        # tolerate dpkg listing directories alongside files.
+        dpkg -L "${pkg}" | while IFS= read -r path; do
+            if [ -L "${path}" ] || [ -f "${path}" ]; then
+                mkdir -p "${TOOLCHAIN_DIR}$(dirname "${path}")"
+                cp -aP "${path}" "${TOOLCHAIN_DIR}${path}" 2>/dev/null || true
+            fi
+        done
+    done
+    # Ubuntu's filesystem layout has /lib -> /usr/lib (UsrMerge) and
+    # /lib64 -> /usr/lib64. ld scripts (e.g. libm.so) hardcode
+    # `/lib/x86_64-linux-gnu/libm.so.6`; with --sysroot the linker
+    # looks for that path under the sysroot, which means we need
+    # the same symlinks under TOOLCHAIN_DIR.
+    [ -e "${TOOLCHAIN_DIR}/lib" ]   || ln -s usr/lib   "${TOOLCHAIN_DIR}/lib"
+    [ -e "${TOOLCHAIN_DIR}/lib64" ] || ln -s usr/lib64 "${TOOLCHAIN_DIR}/lib64"
+
+    # Replace the unversioned g++/gcc/cpp symlinks with wrapper
+    # scripts that pass --sysroot=<toolchain> and -B <gcc-exec-prefix>.
+    # Without these flags gcc would fall back to its compiled-in
+    # /usr search and fail to find headers (the runtime image has no
+    # libc6-dev) or fail to invoke `as`/`ld` (binutils not on PATH at
+    # /usr/bin). Wrappers self-resolve their location at runtime so
+    # they work from any BackendsPath.
+    BIN_DIR="${TOOLCHAIN_DIR}/usr/bin"
+    if [ -n "${GCC_VER}" ] && [ -n "${HOST_TRIPLET}" ]; then
+        # HOST_TRIPLET in package names uses dashes ("x86-64-linux-gnu");
+        # the binary suffix uses underscores in the arch part
+        # ("x86_64-linux-gnu-g++-13"). Translate.
+        BIN_TRIPLET=${HOST_TRIPLET//x86-64/x86_64}
+        for tool in g++ gcc cpp; do
+            real="${BIN_DIR}/${BIN_TRIPLET}-${tool}-${GCC_VER}"
+            if [ -x "${real}" ]; then
+                rm -f "${BIN_DIR}/${tool}" "${BIN_DIR}/${tool}-${GCC_VER}"
+                cat > "${BIN_DIR}/${tool}" <<EOF
+#!/bin/bash
+# Auto-generated by package.sh. Passes --sysroot and -B so the
+# bundled toolchain works from any BackendsPath without depending
+# on libc6-dev / binutils being installed at /usr in the runtime
+# image. See backend/python/vllm/package.sh.
+DIR="\$(dirname "\$(readlink -f "\$0")")"     # …/toolchain/usr/bin
+SYSROOT="\$(dirname "\$(dirname "\${DIR}")")" # …/toolchain
+exec "\${DIR}/${BIN_TRIPLET}-${tool}-${GCC_VER}" \\
+    -B "\${SYSROOT}/usr/lib/gcc/${BIN_TRIPLET}/${GCC_VER}/" \\
+    --sysroot="\${SYSROOT}" \\
+    "\$@"
+EOF
+                chmod +x "${BIN_DIR}/${tool}"
+            fi
+        done
+    fi
+    echo "Bundled g++ toolchain (gcc-${GCC_VER}) into ${TOOLCHAIN_DIR} ($(du -sh "${TOOLCHAIN_DIR}" | cut -f1))"
+fi
+
 echo "vllm packaging completed successfully"
 ls -liah "${LIB_DIR}/"
--- a/backend/python/vllm/pyproject.toml
+++ b/backend/python/vllm/pyproject.toml
@@ -0,0 +1,61 @@
+# L4T arm64 (JetPack 7 / sbsa cu130) install spec for the vllm backend.
+#
+# Why this file exists, and why only the l4t13 BUILD_PROFILE consumes it:
+#
+# pypi.jetson-ai-lab.io hosts the L4T-specific torch / vllm / flash-attn
+# wheels we need on aarch64 + cuda13, but it ALSO transparently proxies the
+# rest of PyPI through `/+f/<sha>/<filename>` URLs that 503 frequently. With
+# `--extra-index-url` + `--index-strategy=unsafe-best-match` (the historical
+# fix in install.sh) uv would pick those proxy URLs for ordinary PyPI
+# packages — `anthropic`, `openai`, `propcache`, `annotated-types` — and
+# trip on the 503s. See e.g. CI run 25212201349 (anthropic-0.97.0).
+#
+# `explicit = true` on the index makes uv consult the L4T mirror ONLY for
+# packages mapped under [tool.uv.sources]. Everything else goes to PyPI.
+# This breaks the historical 503 path without losing access to the L4T
+# wheels we actually need from there.
+#
+# `uv pip install -r requirements.txt` does NOT honor [tool.uv.sources]
+# (sources are project-mode only, not pip-compat mode), so install.sh's
+# l4t13 branch invokes `uv pip install --requirement pyproject.toml`
+# directly. Other BUILD_PROFILEs continue to use the requirements-*.txt
+# pipeline through libbackend.sh's installRequirements and never read
+# this file.
+[project]
+name = "localai-vllm-l4t13"
+version = "0.0.0"
+requires-python = ">=3.12,<3.13"
+dependencies = [
+    # Mirror of requirements.txt — kept in sync manually for now since the
+    # l4t13 path bypasses installRequirements (see install.sh).
+    "grpcio==1.80.0",
+    "protobuf",
+    "certifi",
+    "setuptools",
+    "pillow",
+    "charset-normalizer>=3.4.0",
+    "chardet",
+    # L4T-specific accelerator stack (sourced from jetson-ai-lab below).
+    "torch",
+    "torchvision",
+    "torchaudio",
+    "flash-attn",
+    "vllm",
+    # PyPI-resolvable packages that complete the runtime — accelerate,
+    # transformers, bitsandbytes carry their own wheels for aarch64.
+    "accelerate",
+    "transformers",
+    "bitsandbytes",
+]
+
+[[tool.uv.index]]
+name = "jetson-ai-lab"
+url = "https://pypi.jetson-ai-lab.io/sbsa/cu130"
+explicit = true
+
+[tool.uv.sources]
+torch = { index = "jetson-ai-lab" }
+torchvision = { index = "jetson-ai-lab" }
+torchaudio = { index = "jetson-ai-lab" }
+flash-attn = { index = "jetson-ai-lab" }
+vllm = { index = "jetson-ai-lab" }
--- a/backend/python/vllm/requirements-intel-after.txt
+++ b/backend/python/vllm/requirements-intel-after.txt
@@ -1 +1,3 @@
-vllm
+# Intel XPU has no upstream-published vllm wheels — install.sh builds vllm
+# from source with VLLM_TARGET_DEVICE=xpu and hides this file during
+# installRequirements. Don't add a `vllm` line here.
--- a/backend/python/vllm/requirements-intel.txt
+++ b/backend/python/vllm/requirements-intel.txt
@@ -1,7 +1,8 @@
 --extra-index-url https://download.pytorch.org/whl/xpu
+# vllm's own deps (torch==2.11.0+xpu, vllm_xpu_kernels, pydantic, …) are
+# installed from upstream's requirements/xpu.txt during the source build —
+# see install.sh. Only list what LocalAI's vllm backend.py needs directly.
 accelerate
-torch
 transformers
-optimum[openvino]
+bitsandbytes
 setuptools
-bitsandbytes
--- a/backend/python/vllm/requirements-l4t13-after.txt
+++ b/backend/python/vllm/requirements-l4t13-after.txt
@@ -1,2 +0,0 @@
--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
-vllm
--- a/backend/python/vllm/requirements-l4t13.txt
+++ b/backend/python/vllm/requirements-l4t13.txt
@@ -1,8 +0,0 @@
--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
-accelerate
-torch
-torchvision
-torchaudio
-transformers
-bitsandbytes
-flash-attn
--- a/backend/python/vllm/requirements.txt
+++ b/backend/python/vllm/requirements.txt
@@ -1,4 +1,7 @@
 grpcio==1.80.0
 protobuf
 certifi
-setuptools
+setuptools
+pillow
+charset-normalizer>=3.4.0
+chardet
--- a/backend/python/vllm/run.sh
+++ b/backend/python/vllm/run.sh
@@ -1,4 +1,5 @@
 #!/bin/bash
+set -x

 backend_dir=$(dirname $0)

@@ -8,4 +9,41 @@ else
    source $backend_dir/../common/libbackend.sh
 fi

-startBackend $@
+# CPU profile: torch._inductor's ISA-probe (run at vllm engine
+# startup, even with enforce_eager=True) shells out to g++. The
+# LocalAI runtime image and the FROM-scratch backend image both
+# omit a compiler; package.sh bundles one into ${EDIR}/toolchain
+# along with wrapper scripts at toolchain/usr/bin that already pass
+# --sysroot and -B. So all run.sh has to do is put the wrapper on
+# PATH and expose the toolchain's shared libs (libisl, libmpc, libbfd,
+# ...) to ld.so. No-op for other profiles -- the dir doesn't exist.
+if [ -d "${EDIR}/toolchain/usr/bin" ]; then
+    export PATH="${EDIR}/toolchain/usr/bin:${PATH}"
+    _libpath="${EDIR}/toolchain/usr/lib/x86_64-linux-gnu"
+    export LD_LIBRARY_PATH="${_libpath}${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
+fi
+
+# Multi-node DP follower mode: when the first arg is `serve`, exec into
+# vllm's own CLI instead of LocalAI's backend.py gRPC server. The
+# follower speaks ZMQ directly to the head node's vllm ranks — there
+# is no LocalAI gRPC on the follower side. Reaches this path via
+# `local-ai p2p-worker vllm`.
+if [ "${1:-}" = "serve" ]; then
+    ensureVenv
+    if [ "x${PORTABLE_PYTHON}" == "xtrue" ] || [ -x "$(_portable_python)" ]; then
+        _makeVenvPortable --update-pyvenv-cfg
+    fi
+    if [ -d "${EDIR}/lib" ]; then
+        export LD_LIBRARY_PATH="${EDIR}/lib:${LD_LIBRARY_PATH:-}"
+    fi
+    # Run the vllm console script through the venv python rather than
+    # exec-ing it directly. uv bakes an absolute shebang at install time
+    # (e.g. `#!/vllm/venv/bin/python3` from the build image) which
+    # doesn't exist once the backend is relocated to BackendsPath, and
+    # _makeVenvPortable's shebang rewriter only matches paths that
+    # already point at ${EDIR}. Invoking python with the script as an
+    # argument bypasses the shebang entirely.
+    exec "${EDIR}/venv/bin/python" "${EDIR}/venv/bin/vllm" "$@"
+fi
+
+startBackend $@