feat(vllm, distributed): tensor parallel distributed workers (#9612)

* feat(vllm): build vllm from source for Intel XPU Upstream publishes no XPU wheels for vllm. The Intel profile was silently picking up a non-XPU wheel that imported but errored at engine init, and several runtime deps (pillow, charset-normalizer, chardet) were missing on Intel -- backend.py crashed at import time before the gRPC server came up. Switch the Intel profile to upstream's documented from-source procedure (docs/getting_started/installation/gpu.xpu.inc.md in vllm-project/vllm): - Bump portable Python to 3.12 -- vllm-xpu-kernels ships only a cp312 wheel. - Source /opt/intel/oneapi/setvars.sh so vllm's CMake build sees the dpcpp/sycl compiler from the oneapi-basekit base image. - Hide requirements-intel-after.txt during installRequirements (it used to 'pip install vllm'); install vllm's deps from a fresh git clone of vllm via 'uv pip install -r requirements/xpu.txt', swap stock triton for triton-xpu==3.7.0, then 'VLLM_TARGET_DEVICE=xpu uv pip install --no-deps .'. - requirements-intel.txt trimmed to LocalAI's direct deps (accelerate / transformers / bitsandbytes); torch-xpu, vllm, vllm_xpu_kernels and the rest come from upstream's xpu.txt during the source build. - requirements.txt: add pillow + charset-normalizer + chardet -- used by backend.py and missing on the Intel install profile. - run.sh: 'set -x' so backend startup is visible in container logs (the gRPC startup error path was previously opaque). Also adds a one-line docs example for engine_args.attention_backend under the vLLM section, since older XE-HPG GPUs (e.g. Arc A770) need TRITON_ATTN to bypass the cutlass path in vllm_xpu_kernels. Tested end-to-end on an Intel Arc A770 with Qwen2.5-0.5B-Instruct via LocalAI's /v1/chat/completions. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(vllm): add multi-node data-parallel follower worker vLLM v1's multi-node story is one process per node sharing a DP coordinator over ZMQ -- the head runs the API server with data_parallel_size > 1 and followers run `vllm serve --headless ...` with matching topology. Today LocalAI can already configure DP on the head via the engine_args YAML map, but there's no way to bring up the follower nodes -- so the head sits waiting for ranks that never handshake. Add `local-ai p2p-worker vllm`, mirroring MLXDistributed's structural precedent (operator-launched, static config, no NATS placement). The worker: - Optionally self-registers with the frontend as an agent-type node tagged `node.role=vllm-follower` so it's visible in the admin UI and operators can scope ordinary models away via inverse selectors. - Resolves the platform-specific vllm backend via the gallery's "vllm" meta-entry (cuda*, intel-vllm, rocm-vllm, ...). - Runs vLLM as a child process so the heartbeat goroutine survives until vLLM exits; forwards SIGINT/SIGTERM so vLLM can clean up its ZMQ sockets before we tear down. - Validates --headless + --start-rank 0 is rejected (rank 0 is the head and must serve the API). Backend run.sh dispatches `serve` as the first arg to vllm's own CLI instead of LocalAI's backend.py gRPC server -- the follower speaks ZMQ directly to the head, there is no LocalAI gRPC on the follower side. Single-node usage is unchanged. Generalises the gallery resolution helper into findBackendPath() shared by MLX and vLLM workers; extracts ParseNodeLabels for the comma-separated label parsing both use. Ships with two compose recipes (`docker-compose.vllm-multinode.yaml` for NVIDIA, `docker-compose.vllm-multinode.intel.yaml` for Intel XPU/xccl) plus `tests/e2e/vllm-multinode/smoke.sh`. Both vendors are supported (NCCL for CUDA/ROCm, xccl for XPU) but mixed-vendor DP is not -- PyTorch's process group requires every rank to use the same collective backend, and NCCL/xccl/gloo don't interoperate. Out of scope (deferred): SmartRouter-driven placement of follower ranks via NATS backend.install events, follower log streaming through /api/backend-logs, tensor-parallel across nodes, disaggregated prefill via KVTransferConfig. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(vllm): CPU-only end-to-end test for multi-node DP Adds tests/e2e/vllm-multinode/, a Ginkgo + testcontainers-go suite that brings up a head + headless follower from the locally-built local-ai:tests image, bind-mounts the cpu-vllm backend extracted by make extract-backend-vllm so it's seen as a system backend (no gallery fetch, no registry server), and asserts a chat completion across both DP ranks. New `make test-e2e-vllm-multinode` target wires the docker build, backend extract, and ginkgo run together; BuildKit caches both images so re-runs only rebuild what changed. Tagged Label("VLLMMultinode") so the existing distributed suite isn't pulled along. Two pre-existing bugs surfaced by the test: 1. extract-backend-% (Makefile) failed for every backend, because all backend images end with `FROM scratch` and `docker create` rejects an image with no CMD/ENTRYPOINT. Fixed by passing --entrypoint=/run.sh -- the container is never started, only docker-cp'd, so the path doesn't have to exist; we just need anything that satisfies the daemon's create-time validation. 2. backend/python/vllm/run.sh's `serve` shortcut for the multi-node DP follower exec'd ${EDIR}/venv/bin/vllm directly, but uv bakes an absolute build-time shebang (`#!/vllm/venv/bin/python3`) that no longer resolves once the backend is relocated to BackendsPath. _makeVenvPortable's shebang rewriter only matches paths that already point at ${EDIR}, so the original shebang slips through unchanged. Fixed by exec-ing ${EDIR}/venv/bin/python with the script as an argument -- Python ignores the script's shebang in that case. The test fixture caps memory aggressively (max_model_len=512, VLLM_CPU_KVCACHE_SPACE=1, TORCH_COMPILE_DISABLE=1) so two CPU engines fit on a 32 GB box. TORCH_COMPILE_DISABLE is currently mandatory for cpu-vllm: torch._inductor's CPU-ISA probe runs even with enforce_eager=True and needs g++ on PATH, which the LocalAI runtime image doesn't ship -- to be addressed in a follow-up that bundles a toolchain in the cpu-vllm backend. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(vllm): bundle a g++ toolchain in the cpu-vllm backend image torch._inductor's CPU-ISA probe (`cpu_model_runner.py:65 "Warming up model for the compilation"`) shells out to `g++` at vllm engine startup, regardless of `enforce_eager=True` -- the eager flag only disables CUDA graphs, not inductor's first-batch warmup. The LocalAI CPU runtime image (Dockerfile, unconditional apt list) does not ship build-essential, and the cpu-vllm backend image is `FROM scratch`, so any non-trivial inference on cpu-vllm crashes with: torch._inductor.exc.InductorError: InvalidCxxCompiler: No working C++ compiler found in torch._inductor.config.cpp.cxx: (None, 'g++') Bundling the toolchain in the CPU runtime image would bloat every non-vllm-CPU deployment and force a single GCC version on backends that may want clang or a different version. So this lives in the backend, gated to BUILD_TYPE=='' (the CPU profile). `package.sh` snapshots g++ + binutils + cc1plus + libstdc++ + libc6 (runtime + dev) + the math libs cc1plus links (libisl/libmpc/libmpfr/ libjansson) into ${BACKEND}/toolchain/, mirroring /usr/... layout. The unversioned binaries on Debian/Ubuntu are symlink chains pointing into multiarch packages (`g++` -> `g++-13` -> `x86_64-linux-gnu-g++-13`, the latter in `g++-13-x86-64-linux-gnu`), so the package list resolves both the version and the arch-triplet variant. Symlinks /lib -> usr/lib and /lib64 -> usr/lib64 are recreated under the toolchain root because Ubuntu's UsrMerge keeps them at /, and ld scripts (`libc.so`, `libm.so`) hardcode `/lib/...` paths that --sysroot re-roots into the toolchain. The unversioned `g++`/`gcc`/`cpp` symlinks are replaced with wrapper shell scripts that resolve their own location at runtime and pass `--sysroot=<toolchain>` and `-B <toolchain>/usr/lib/gcc/<triplet>/<ver>/` to the underlying versioned binary. That's how torch's bare `g++ foo.cpp -o foo` invocation finds cc1plus (-B), system headers (--sysroot), and the bundled libstdc++ (--sysroot, --sysroot is recursive into linker). `run.sh` adds the toolchain bin dir to PATH and the toolchain's shared-lib dir to LD_LIBRARY_PATH -- everything else (header search, linker search, executable search) is encapsulated in the wrappers. No-op for non-CPU builds, the dir doesn't exist there. The cpu-vllm image grows by ~217 MB. Tradeoff is acceptable -- cpu-vllm is already a niche profile (few users compared to GPU vllm) and the alternative is a backend that crashes at first inference unless the operator manually sets TORCH_COMPILE_DISABLE=1, which silently disables all torch.compile optimizations. Drops `TORCH_COMPILE_DISABLE=1` from tests/e2e/vllm-multinode -- the smoke now exercises the real compile path through the bundled toolchain. Test runtime is +20s for the warmup compile, still <90s end to end. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(vllm): scope jetson-ai-lab index to L4T-specific wheels via pyproject.toml The L4T arm64 build resolves dependencies through pypi.jetson-ai-lab.io, which hosts the L4T-specific torch / vllm / flash-attn wheels but also transparently proxies the rest of PyPI through `/+f/<sha>/<filename>` URLs. With `--extra-index-url` + `--index-strategy=unsafe-best-match` uv would pick those proxy URLs for ordinary PyPI packages — anthropic/openai/propcache/annotated-types — and fail when the proxy 503s. Master is hitting the same bug on its own l4t-vllm matrix entry. Switch the l4t13 install path to a pyproject.toml that marks the jetson-ai-lab index `explicit = true` and pins only torch, torchvision, torchaudio, flash-attn, and vllm to it via [tool.uv.sources]. uv won't consult the L4T mirror for anything else, so transitive deps fall back to PyPI as the default index — no exposure to the proxy 503s. `uv pip install -r requirements.txt` ignores [tool.uv.sources], so the l4t13 branch in install.sh now invokes `uv pip install --requirement pyproject.toml` directly, replacing the old requirements-l4t13*.txt files. Other BUILD_PROFILEs continue using libbackend.sh's installRequirements and never read pyproject.toml. Local resolution test (x86_64, dry-run) confirms uv hits the L4T index for torch and falls through to PyPI for everything else. Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-16 20:52:08 -04:00 · 2026-05-05 23:22:50 +01:00
parent 503904d311
commit 8e43842175
20 changed files with 1090 additions and 74 deletions
--- a/24
+++ b/24
@@ -232,6 +232,20 @@ run-e2e-aio: protogen-go
 	@echo 'Running e2e AIO tests'
 	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e-aio

+# vLLM multi-node DP smoke (CPU). Builds local-ai:tests and the
+# cpu-vllm backend from the current working tree, then drives a
+# head + headless follower via testcontainers-go and asserts a chat
+# completion. BuildKit caches both images, so re-runs only rebuild
+# what changed. The test lives under tests/e2e/distributed and is
+# selected by the VLLMMultinode label so it doesn't run alongside
+# the other distributed-suite tests by default.
+test-e2e-vllm-multinode: docker-build-e2e extract-backend-vllm protogen-go
+	@echo 'Running e2e vLLM multi-node DP test'
+	LOCALAI_IMAGE=local-ai \
+	LOCALAI_IMAGE_TAG=tests \
+	LOCALAI_VLLM_BACKEND_DIR=$(abspath ./local-backends/vllm) \
+	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter='VLLMMultinode' -v -r ./tests/e2e/distributed
+
 ########################################################
 ## E2E tests
 ########################################################
@@ -319,7 +333,7 @@ local-backends:

 extract-backend-%: docker-build-% local-backends
 	@echo "Extracting backend $*..."
-	@CID=$$(docker create local-ai-backend:$*) && \
+	@CID=$$(docker create --entrypoint=/run.sh local-ai-backend:$*) && \
 	  rm -rf local-backends/$* && mkdir -p local-backends/$* && \
 	  docker cp $$CID:/ - | tar -xf - -C local-backends/$* && \
 	  docker rm $$CID > /dev/null
@@ -594,6 +608,14 @@ test-extra-backend-vllm: docker-build-vllm
 	BACKEND_TEST_OPTIONS=tool_parser:hermes \
 	$(MAKE) test-extra-backend

+## vllm multi-node data-parallel smoke test. Runs LocalAI head + a
+## `local-ai p2p-worker vllm` follower in docker compose against
+## Qwen2.5-0.5B with data_parallel_size=2. Requires 2 NVIDIA GPUs and
+## nvidia-container-runtime on the host — vLLM v1's DP coordinator is
+## not viable on CPU so this cannot run in CI without GPU.
+test-extra-backend-vllm-multinode:
+	./tests/e2e/vllm-multinode/smoke.sh
+
 ## tinygrad mirrors the vllm target (same model, same caps, same parser) so
 ## the two backends are directly comparable. The LLM path covers Predict,
 ## streaming and native tool-call extraction. Companion targets below cover
--- a/backend/python/vllm/install.sh
+++ b/backend/python/vllm/install.sh
@@ -18,12 +18,15 @@ else
    source $backend_dir/../common/libbackend.sh
 fi

-# This is here because the Intel pip index is broken and returns 200 status codes for every package name, it just doesn't return any package links.
-# This makes uv think that the package exists in the Intel pip index, and by default it stops looking at other pip indexes once it finds a match.
-# We need uv to continue falling through to the pypi default index to find optimum[openvino] in the pypi index
-# the --upgrade actually allows us to *downgrade* torch to the version provided in the Intel pip index
+# Intel XPU: torch==2.11.0+xpu lives on the PyTorch XPU index, transitive
+# deps on PyPI — unsafe-best-match lets uv mix both. vllm-xpu-kernels only
+# ships a python3.12 wheel per upstream docs, so bump the portable Python
+# before installRequirements (matches the l4t13 pattern below).
+# https://github.com/vllm-project/vllm/blob/main/docs/getting_started/installation/gpu.xpu.inc.md
 if [ "x${BUILD_PROFILE}" == "xintel" ]; then
-    EXTRA_PIP_INSTALL_FLAGS+=" --upgrade --index-strategy=unsafe-first-match"
+    PYTHON_VERSION="3.12"
+    PYTHON_PATCH="11"
+    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
 fi

 # CPU builds need unsafe-best-match to pull torch==2.10.0+cpu from the
@@ -42,10 +45,12 @@ fi

 # JetPack 7 / L4T arm64 wheels (torch, vllm, flash-attn) live on
 # pypi.jetson-ai-lab.io and are built for cp312, so bump the venv Python
-# accordingly. JetPack 6 keeps cp310 + USE_PIP=true. unsafe-best-match
-# is required because the jetson-ai-lab index lists transitive deps at
-# limited versions — without it uv pins to the first matching index and
-# fails to resolve a compatible wheel from PyPI.
+# accordingly. JetPack 6 keeps cp310 + USE_PIP=true.
+#
+# l4t13 uses pyproject.toml (see the elif branch below) to pin only the
+# L4T-specific wheels to the jetson-ai-lab index via [tool.uv.sources].
+# That keeps PyPI as the resolution path for transitive deps like
+# anthropic/openai/propcache, which the L4T mirror's proxy 503s on.
 if [ "x${BUILD_PROFILE}" == "xl4t12" ]; then
    USE_PIP=true
 fi
@@ -53,16 +58,77 @@ if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
    PYTHON_VERSION="3.12"
    PYTHON_PATCH="12"
    PY_STANDALONE_TAG="20251120"
-    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
 fi

+# Intel XPU has no upstream-published vllm wheels, so we always build vllm
+# from source against torch-xpu and replace the default triton with
+# triton-xpu (matching torch 2.11). Mirrors the upstream procedure:
+# https://github.com/vllm-project/vllm/blob/main/docs/getting_started/installation/gpu.xpu.inc.md
+if [ "x${BUILD_TYPE}" == "xintel" ]; then
+    # Hide requirements-intel-after.txt so installRequirements doesn't
+    # try `pip install vllm` (would either fail or grab a non-XPU wheel).
+    _intel_after="${backend_dir}/requirements-intel-after.txt"
+    _intel_after_bak=""
+    if [ -f "${_intel_after}" ]; then
+        _intel_after_bak="${_intel_after}.xpu.bak"
+        mv "${_intel_after}" "${_intel_after_bak}"
+    fi
+    installRequirements
+    if [ -n "${_intel_after_bak}" ]; then
+        mv "${_intel_after_bak}" "${_intel_after}"
+    fi
+
+    # vllm's CMake build needs the Intel oneAPI dpcpp/sycl compiler — the
+    # base image (intel/oneapi-basekit) has it but the env isn't sourced.
+    if [ -f /opt/intel/oneapi/setvars.sh ]; then
+        set +u
+        source /opt/intel/oneapi/setvars.sh --force
+        set -u
+    fi
+
+    _vllm_src=$(mktemp -d)
+    trap 'rm -rf "${_vllm_src}"' EXIT
+    git clone --depth 1 https://github.com/vllm-project/vllm "${_vllm_src}/vllm"
+    pushd "${_vllm_src}/vllm"
+        # Install vllm's own runtime deps (torch-xpu, vllm_xpu_kernels,
+        # pydantic, fastapi, …) from upstream's requirements/xpu.txt — the
+        # canonical source of truth. Avoids re-pinning everything ourselves.
+        uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} -r requirements/xpu.txt
+        # Stock triton (NVIDIA-only) may have come in transitively; replace
+        # with triton-xpu==3.7.0 which matches torch 2.11.
+        uv pip uninstall triton triton-xpu 2>/dev/null || true
+        uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} \
+            --extra-index-url https://download.pytorch.org/whl/xpu \
+            triton-xpu==3.7.0
+        export CMAKE_PREFIX_PATH="$(python -c 'import site; print(site.getsitepackages()[0])'):${CMAKE_PREFIX_PATH:-}"
+        VLLM_TARGET_DEVICE=xpu uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --no-deps .
+    popd
+# L4T arm64 (JetPack 7): drive the install through pyproject.toml so that
+# [tool.uv.sources] can pin torch/vllm/flash-attn/torchvision/torchaudio
+# to the jetson-ai-lab index, while everything else (transitive deps and
+# PyPI-resolvable packages like transformers) comes from PyPI. Bypasses
+# installRequirements because uv pip install -r requirements.txt does not
+# honor sources — see backend/python/vllm/pyproject.toml for the rationale.
+elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
+    ensureVenv
+    if [ "x${PORTABLE_PYTHON}" == "xtrue" ]; then
+        export C_INCLUDE_PATH="${C_INCLUDE_PATH:-}:$(_portable_dir)/include/python${PYTHON_VERSION}"
+    fi
+    pushd "${backend_dir}"
+        # Build deps first (matches installRequirements' requirements-install.txt
+        # pass — fastsafetensors and friends need pybind11 in the venv before
+        # their sdists can build under --no-build-isolation).
+        uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} -r requirements-install.txt
+        uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --requirement pyproject.toml
+    popd
+    runProtogen
 # FROM_SOURCE=true on a CPU build skips the prebuilt vllm wheel in
 # requirements-cpu-after.txt and compiles vllm locally against the host's
 # actual CPU. Not used by default because it takes ~30-40 minutes, but
 # kept here for hosts where the prebuilt wheel SIGILLs (CPU without the
 # required SIMD baseline, e.g. AVX-512 VNNI/BF16). Default CI uses a
 # bigger-runner with compatible hardware instead.
-if [ "x${BUILD_TYPE}" == "x" ] && [ "x${FROM_SOURCE:-}" == "xtrue" ]; then
+elif [ "x${BUILD_TYPE}" == "x" ] && [ "x${FROM_SOURCE:-}" == "xtrue" ]; then
    # Temporarily hide the prebuilt wheel so installRequirements doesn't
    # pull it — the rest of the requirements files (base deps, torch,
    # transformers) are still installed normally.
--- a/backend/python/vllm/package.sh
+++ b/backend/python/vllm/package.sh
@@ -45,5 +45,109 @@ copy_with_symlinks() {
 copy_with_symlinks libnuma.so.1
 copy_with_symlinks libgomp.so.1

+# CPU profile only: bundle a g++ toolchain so torch._inductor's
+# ISA probe (always run at vllm engine startup, regardless of
+# enforce_eager) finds a C++ compiler. The LocalAI runtime image
+# is FROM ubuntu:24.04 with a minimal apt list that does not
+# include build-essential, and the backend image itself is FROM
+# scratch -- so without this, cpu-vllm crashes with
+# torch._inductor.exc.InvalidCxxCompiler at first inference
+# unless the operator manually sets TORCH_COMPILE_DISABLE=1.
+#
+# We snapshot every file owned by the toolchain packages, mirroring
+# the /usr/... layout into ${BACKEND}/toolchain/ so g++ can find
+# cc1plus, headers, libs etc. via GCC_EXEC_PREFIX / CPATH /
+# LIBRARY_PATH at runtime (libbackend.sh wires those up). Adds
+# ~400 MB to the cpu-vllm image, which is tolerable -- cpu-vllm is
+# already a niche profile.
+if [ "${BUILD_TYPE:-}" = "" ] && command -v dpkg-query >/dev/null 2>&1; then
+    TOOLCHAIN_DIR="${CURDIR}/toolchain"
+    mkdir -p "${TOOLCHAIN_DIR}"
+    # The unversioned g++/gcc packages on Debian/Ubuntu only ship
+    # symlinks; the actual binaries live in g++-${VER}/gcc-${VER}.
+    # Discover the active version so the symlink targets get bundled
+    # along with their owners.
+    GCC_VER=$(gcc -dumpversion 2>/dev/null | cut -d. -f1 || true)
+    # `g++-${VER}` itself is just another symlink layer on Debian/
+    # Ubuntu — the real binary `x86_64-linux-gnu-g++-${VER}` lives
+    # in `g++-${VER}-x86-64-linux-gnu` (a separate package pulled in
+    # as a dependency). Same story for gcc/cpp. Compute the dpkg
+    # arch-triplet to find the right package name for both amd64 and
+    # arm64 hosts.
+    case "$(dpkg --print-architecture 2>/dev/null)" in
+        amd64) HOST_TRIPLET="x86-64-linux-gnu" ;;
+        arm64) HOST_TRIPLET="aarch64-linux-gnu" ;;
+        *)     HOST_TRIPLET="" ;;
+    esac
+    PKGS=(g++ gcc cpp libstdc++-${GCC_VER}-dev libgcc-${GCC_VER}-dev libc6 libc6-dev binutils binutils-common libbinutils libc-dev-bin linux-libc-dev libcrypt-dev libgomp1 libstdc++6 libgcc-s1 libisl23 libmpc3 libmpfr6 libjansson4 libctf0 libctf-nobfd0 libsframe1)
+    if [ -n "${GCC_VER}" ]; then
+        PKGS+=("g++-${GCC_VER}" "gcc-${GCC_VER}" "cpp-${GCC_VER}" "gcc-${GCC_VER}-base")
+        if [ -n "${HOST_TRIPLET}" ]; then
+            PKGS+=(
+                "g++-${GCC_VER}-${HOST_TRIPLET}"
+                "gcc-${GCC_VER}-${HOST_TRIPLET}"
+                "cpp-${GCC_VER}-${HOST_TRIPLET}"
+                "binutils-${HOST_TRIPLET}"
+            )
+        fi
+    fi
+    for pkg in "${PKGS[@]}"; do
+        if ! dpkg-query -W "${pkg}" >/dev/null 2>&1; then
+            continue
+        fi
+        # Copy each owned path, preserving symlinks and mode. We
+        # tolerate dpkg listing directories alongside files.
+        dpkg -L "${pkg}" | while IFS= read -r path; do
+            if [ -L "${path}" ] || [ -f "${path}" ]; then
+                mkdir -p "${TOOLCHAIN_DIR}$(dirname "${path}")"
+                cp -aP "${path}" "${TOOLCHAIN_DIR}${path}" 2>/dev/null || true
+            fi
+        done
+    done
+    # Ubuntu's filesystem layout has /lib -> /usr/lib (UsrMerge) and
+    # /lib64 -> /usr/lib64. ld scripts (e.g. libm.so) hardcode
+    # `/lib/x86_64-linux-gnu/libm.so.6`; with --sysroot the linker
+    # looks for that path under the sysroot, which means we need
+    # the same symlinks under TOOLCHAIN_DIR.
+    [ -e "${TOOLCHAIN_DIR}/lib" ]   || ln -s usr/lib   "${TOOLCHAIN_DIR}/lib"
+    [ -e "${TOOLCHAIN_DIR}/lib64" ] || ln -s usr/lib64 "${TOOLCHAIN_DIR}/lib64"
+
+    # Replace the unversioned g++/gcc/cpp symlinks with wrapper
+    # scripts that pass --sysroot=<toolchain> and -B <gcc-exec-prefix>.
+    # Without these flags gcc would fall back to its compiled-in
+    # /usr search and fail to find headers (the runtime image has no
+    # libc6-dev) or fail to invoke `as`/`ld` (binutils not on PATH at
+    # /usr/bin). Wrappers self-resolve their location at runtime so
+    # they work from any BackendsPath.
+    BIN_DIR="${TOOLCHAIN_DIR}/usr/bin"
+    if [ -n "${GCC_VER}" ] && [ -n "${HOST_TRIPLET}" ]; then
+        # HOST_TRIPLET in package names uses dashes ("x86-64-linux-gnu");
+        # the binary suffix uses underscores in the arch part
+        # ("x86_64-linux-gnu-g++-13"). Translate.
+        BIN_TRIPLET=${HOST_TRIPLET//x86-64/x86_64}
+        for tool in g++ gcc cpp; do
+            real="${BIN_DIR}/${BIN_TRIPLET}-${tool}-${GCC_VER}"
+            if [ -x "${real}" ]; then
+                rm -f "${BIN_DIR}/${tool}" "${BIN_DIR}/${tool}-${GCC_VER}"
+                cat > "${BIN_DIR}/${tool}" <<EOF
+#!/bin/bash
+# Auto-generated by package.sh. Passes --sysroot and -B so the
+# bundled toolchain works from any BackendsPath without depending
+# on libc6-dev / binutils being installed at /usr in the runtime
+# image. See backend/python/vllm/package.sh.
+DIR="\$(dirname "\$(readlink -f "\$0")")"     # …/toolchain/usr/bin
+SYSROOT="\$(dirname "\$(dirname "\${DIR}")")" # …/toolchain
+exec "\${DIR}/${BIN_TRIPLET}-${tool}-${GCC_VER}" \\
+    -B "\${SYSROOT}/usr/lib/gcc/${BIN_TRIPLET}/${GCC_VER}/" \\
+    --sysroot="\${SYSROOT}" \\
+    "\$@"
+EOF
+                chmod +x "${BIN_DIR}/${tool}"
+            fi
+        done
+    fi
+    echo "Bundled g++ toolchain (gcc-${GCC_VER}) into ${TOOLCHAIN_DIR} ($(du -sh "${TOOLCHAIN_DIR}" | cut -f1))"
+fi
+
 echo "vllm packaging completed successfully"
 ls -liah "${LIB_DIR}/"
--- a/backend/python/vllm/pyproject.toml
+++ b/backend/python/vllm/pyproject.toml
@@ -0,0 +1,61 @@
+# L4T arm64 (JetPack 7 / sbsa cu130) install spec for the vllm backend.
+#
+# Why this file exists, and why only the l4t13 BUILD_PROFILE consumes it:
+#
+# pypi.jetson-ai-lab.io hosts the L4T-specific torch / vllm / flash-attn
+# wheels we need on aarch64 + cuda13, but it ALSO transparently proxies the
+# rest of PyPI through `/+f/<sha>/<filename>` URLs that 503 frequently. With
+# `--extra-index-url` + `--index-strategy=unsafe-best-match` (the historical
+# fix in install.sh) uv would pick those proxy URLs for ordinary PyPI
+# packages — `anthropic`, `openai`, `propcache`, `annotated-types` — and
+# trip on the 503s. See e.g. CI run 25212201349 (anthropic-0.97.0).
+#
+# `explicit = true` on the index makes uv consult the L4T mirror ONLY for
+# packages mapped under [tool.uv.sources]. Everything else goes to PyPI.
+# This breaks the historical 503 path without losing access to the L4T
+# wheels we actually need from there.
+#
+# `uv pip install -r requirements.txt` does NOT honor [tool.uv.sources]
+# (sources are project-mode only, not pip-compat mode), so install.sh's
+# l4t13 branch invokes `uv pip install --requirement pyproject.toml`
+# directly. Other BUILD_PROFILEs continue to use the requirements-*.txt
+# pipeline through libbackend.sh's installRequirements and never read
+# this file.
+[project]
+name = "localai-vllm-l4t13"
+version = "0.0.0"
+requires-python = ">=3.12,<3.13"
+dependencies = [
+    # Mirror of requirements.txt — kept in sync manually for now since the
+    # l4t13 path bypasses installRequirements (see install.sh).
+    "grpcio==1.80.0",
+    "protobuf",
+    "certifi",
+    "setuptools",
+    "pillow",
+    "charset-normalizer>=3.4.0",
+    "chardet",
+    # L4T-specific accelerator stack (sourced from jetson-ai-lab below).
+    "torch",
+    "torchvision",
+    "torchaudio",
+    "flash-attn",
+    "vllm",
+    # PyPI-resolvable packages that complete the runtime — accelerate,
+    # transformers, bitsandbytes carry their own wheels for aarch64.
+    "accelerate",
+    "transformers",
+    "bitsandbytes",
+]
+
+[[tool.uv.index]]
+name = "jetson-ai-lab"
+url = "https://pypi.jetson-ai-lab.io/sbsa/cu130"
+explicit = true
+
+[tool.uv.sources]
+torch = { index = "jetson-ai-lab" }
+torchvision = { index = "jetson-ai-lab" }
+torchaudio = { index = "jetson-ai-lab" }
+flash-attn = { index = "jetson-ai-lab" }
+vllm = { index = "jetson-ai-lab" }
--- a/backend/python/vllm/requirements-intel-after.txt
+++ b/backend/python/vllm/requirements-intel-after.txt
@@ -1 +1,3 @@
-vllm
+# Intel XPU has no upstream-published vllm wheels — install.sh builds vllm
+# from source with VLLM_TARGET_DEVICE=xpu and hides this file during
+# installRequirements. Don't add a `vllm` line here.
--- a/backend/python/vllm/requirements-intel.txt
+++ b/backend/python/vllm/requirements-intel.txt
@@ -1,7 +1,8 @@
 --extra-index-url https://download.pytorch.org/whl/xpu
+# vllm's own deps (torch==2.11.0+xpu, vllm_xpu_kernels, pydantic, …) are
+# installed from upstream's requirements/xpu.txt during the source build —
+# see install.sh. Only list what LocalAI's vllm backend.py needs directly.
 accelerate
-torch
 transformers
-optimum[openvino]
+bitsandbytes
 setuptools
-bitsandbytes
--- a/backend/python/vllm/requirements-l4t13-after.txt
+++ b/backend/python/vllm/requirements-l4t13-after.txt
@@ -1,2 +0,0 @@
--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
-vllm
--- a/backend/python/vllm/requirements-l4t13.txt
+++ b/backend/python/vllm/requirements-l4t13.txt
@@ -1,8 +0,0 @@
--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
-accelerate
-torch
-torchvision
-torchaudio
-transformers
-bitsandbytes
-flash-attn
--- a/backend/python/vllm/requirements.txt
+++ b/backend/python/vllm/requirements.txt
@@ -1,4 +1,7 @@
 grpcio==1.80.0
 protobuf
 certifi
-setuptools
+setuptools
+pillow
+charset-normalizer>=3.4.0
+chardet
--- a/backend/python/vllm/run.sh
+++ b/backend/python/vllm/run.sh
@@ -1,4 +1,5 @@
 #!/bin/bash
+set -x

 backend_dir=$(dirname $0)

@@ -8,4 +9,41 @@ else
    source $backend_dir/../common/libbackend.sh
 fi

-startBackend $@
+# CPU profile: torch._inductor's ISA-probe (run at vllm engine
+# startup, even with enforce_eager=True) shells out to g++. The
+# LocalAI runtime image and the FROM-scratch backend image both
+# omit a compiler; package.sh bundles one into ${EDIR}/toolchain
+# along with wrapper scripts at toolchain/usr/bin that already pass
+# --sysroot and -B. So all run.sh has to do is put the wrapper on
+# PATH and expose the toolchain's shared libs (libisl, libmpc, libbfd,
+# ...) to ld.so. No-op for other profiles -- the dir doesn't exist.
+if [ -d "${EDIR}/toolchain/usr/bin" ]; then
+    export PATH="${EDIR}/toolchain/usr/bin:${PATH}"
+    _libpath="${EDIR}/toolchain/usr/lib/x86_64-linux-gnu"
+    export LD_LIBRARY_PATH="${_libpath}${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
+fi
+
+# Multi-node DP follower mode: when the first arg is `serve`, exec into
+# vllm's own CLI instead of LocalAI's backend.py gRPC server. The
+# follower speaks ZMQ directly to the head node's vllm ranks — there
+# is no LocalAI gRPC on the follower side. Reaches this path via
+# `local-ai p2p-worker vllm`.
+if [ "${1:-}" = "serve" ]; then
+    ensureVenv
+    if [ "x${PORTABLE_PYTHON}" == "xtrue" ] || [ -x "$(_portable_python)" ]; then
+        _makeVenvPortable --update-pyvenv-cfg
+    fi
+    if [ -d "${EDIR}/lib" ]; then
+        export LD_LIBRARY_PATH="${EDIR}/lib:${LD_LIBRARY_PATH:-}"
+    fi
+    # Run the vllm console script through the venv python rather than
+    # exec-ing it directly. uv bakes an absolute shebang at install time
+    # (e.g. `#!/vllm/venv/bin/python3` from the build image) which
+    # doesn't exist once the backend is relocated to BackendsPath, and
+    # _makeVenvPortable's shebang rewriter only matches paths that
+    # already point at ${EDIR}. Invoking python with the script as an
+    # argument bypasses the shebang entirely.
+    exec "${EDIR}/venv/bin/python" "${EDIR}/venv/bin/vllm" "$@"
+fi
+
+startBackend $@
--- a/core/cli/worker/labels.go
+++ b/core/cli/worker/labels.go
@@ -0,0 +1,20 @@
+package worker
+
+import "strings"
+
+// ParseNodeLabels parses a comma-separated `k=v,k=v` string into a map.
+// Whitespace around keys, values, and pairs is trimmed; pairs without
+// `=` are skipped silently.
+func ParseNodeLabels(input string) map[string]string {
+	labels := make(map[string]string)
+	if input == "" {
+		return labels
+	}
+	for _, pair := range strings.Split(input, ",") {
+		pair = strings.TrimSpace(pair)
+		if k, v, ok := strings.Cut(pair, "="); ok {
+			labels[strings.TrimSpace(k)] = strings.TrimSpace(v)
+		}
+	}
+	return labels
+}
--- a/core/cli/worker/worker.go
+++ b/core/cli/worker/worker.go
@@ -8,8 +8,9 @@ type WorkerFlags struct {
 }

 type Worker struct {
-	P2P            P2P            `cmd:"" name:"p2p-llama-cpp-rpc" help:"Starts a LocalAI llama.cpp worker in P2P mode (requires a token)"`
-	P2PMLX         P2PMLX         `cmd:"" name:"p2p-mlx" help:"Starts a LocalAI MLX distributed worker in P2P mode (requires a token)"`
-	LLamaCPP       LLamaCPP       `cmd:"" name:"llama-cpp-rpc" help:"Starts a llama.cpp worker in standalone mode"`
-	MLXDistributed MLXDistributed `cmd:"" name:"mlx-distributed" help:"Starts an MLX distributed worker in standalone mode (requires --hostfile and --rank)"`
+	P2P             P2P             `cmd:"" name:"p2p-llama-cpp-rpc" help:"Starts a LocalAI llama.cpp worker in P2P mode (requires a token)"`
+	P2PMLX          P2PMLX          `cmd:"" name:"p2p-mlx" help:"Starts a LocalAI MLX distributed worker in P2P mode (requires a token)"`
+	LLamaCPP        LLamaCPP        `cmd:"" name:"llama-cpp-rpc" help:"Starts a llama.cpp worker in standalone mode"`
+	MLXDistributed  MLXDistributed  `cmd:"" name:"mlx-distributed" help:"Starts an MLX distributed worker in standalone mode (requires --hostfile and --rank)"`
+	VLLMDistributed VLLMDistributed `cmd:"" name:"vllm" help:"Starts a vLLM data-parallel follower process. Multi-node DP for a single model: head runs the existing vllm backend with engine_args.data_parallel_size>1, followers run this command."`
 }
--- a/core/cli/worker/worker_backend_common.go
+++ b/core/cli/worker/worker_backend_common.go
@@ -0,0 +1,58 @@
+package worker
+
+import (
+	"context"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"path/filepath"
+
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/gallery"
+	"github.com/mudler/LocalAI/pkg/model"
+	"github.com/mudler/LocalAI/pkg/system"
+	"github.com/mudler/xlog"
+)
+
+// findBackendPath resolves the directory containing a backend's run.sh,
+// installing the backend from the gallery if it isn't present.
+// `name` is the gallery entry name (for vLLM the meta entry "vllm"
+// resolves to a platform-specific package via capability lookup).
+func findBackendPath(name, galleries string, systemState *system.SystemState) (string, error) {
+	backends, err := gallery.ListSystemBackends(systemState)
+	if err != nil {
+		return "", err
+	}
+	if backend, ok := backends.Get(name); ok {
+		return runFileDir(backend.RunFile)
+	}
+
+	ml := model.NewModelLoader(systemState)
+	var gals []config.Gallery
+	if err := json.Unmarshal([]byte(galleries), &gals); err != nil {
+		xlog.Error("failed loading galleries", "error", err)
+		return "", err
+	}
+	if err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, name, nil, true); err != nil {
+		xlog.Error("backend not found, failed to install it", "name", name, "error", err)
+		return "", err
+	}
+
+	backends, err = gallery.ListSystemBackends(systemState)
+	if err != nil {
+		return "", err
+	}
+	backend, ok := backends.Get(name)
+	if !ok {
+		return "", fmt.Errorf("%s backend not found after install", name)
+	}
+	return runFileDir(backend.RunFile)
+}
+
+func runFileDir(runFile string) (string, error) {
+	dir := filepath.Dir(runFile)
+	if dir == "" {
+		return "", errors.New("backend has no run.sh, install it first")
+	}
+	return dir, nil
+}
--- a/core/cli/worker/worker_mlx_common.go
+++ b/core/cli/worker/worker_mlx_common.go
@@ -1,57 +1,16 @@
 package worker

 import (
-	"context"
-	"encoding/json"
-	"errors"
 	"os/exec"
 	"path/filepath"

-	"github.com/mudler/LocalAI/core/config"
-	"github.com/mudler/LocalAI/core/gallery"
-	"github.com/mudler/LocalAI/pkg/model"
 	"github.com/mudler/LocalAI/pkg/system"
-	"github.com/mudler/xlog"
 )

 const mlxDistributedGalleryName = "mlx-distributed"

-// findMLXDistributedBackendPath finds or installs the mlx-distributed backend
-// and returns the directory containing run.sh.
 func findMLXDistributedBackendPath(galleries string, systemState *system.SystemState) (string, error) {
-	backends, err := gallery.ListSystemBackends(systemState)
-	if err != nil {
-		return "", err
-	}
-
-	backend, ok := backends.Get(mlxDistributedGalleryName)
-	if !ok {
-		ml := model.NewModelLoader(systemState)
-		var gals []config.Gallery
-		if err := json.Unmarshal([]byte(galleries), &gals); err != nil {
-			xlog.Error("failed loading galleries", "error", err)
-			return "", err
-		}
-		if err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, mlxDistributedGalleryName, nil, true); err != nil {
-			xlog.Error("mlx-distributed backend not found, failed to install it", "error", err)
-			return "", err
-		}
-		// Re-fetch after install
-		backends, err = gallery.ListSystemBackends(systemState)
-		if err != nil {
-			return "", err
-		}
-		backend, ok = backends.Get(mlxDistributedGalleryName)
-		if !ok {
-			return "", errors.New("mlx-distributed backend not found after install")
-		}
-	}
-
-	backendPath := filepath.Dir(backend.RunFile)
-	if backendPath == "" {
-		return "", errors.New("mlx-distributed backend not found, install it first")
-	}
-	return backendPath, nil
+	return findBackendPath(mlxDistributedGalleryName, galleries, systemState)
 }

 // buildMLXCommand builds the exec.Cmd to launch the mlx-distributed backend.
--- a/core/cli/worker/worker_suite_test.go
+++ b/core/cli/worker/worker_suite_test.go
@@ -0,0 +1,13 @@
+package worker
+
+import (
+	"testing"
+
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+func TestWorker(t *testing.T) {
+	RegisterFailHandler(Fail)
+	RunSpecs(t, "Worker Suite")
+}
--- a/core/cli/worker/worker_vllm.go
+++ b/core/cli/worker/worker_vllm.go
@@ -0,0 +1,221 @@
+package worker
+
+import (
+	"cmp"
+	"context"
+	"fmt"
+	"os"
+	"os/exec"
+	"os/signal"
+	"path/filepath"
+	"strconv"
+	"syscall"
+	"time"
+
+	cliContext "github.com/mudler/LocalAI/core/cli/context"
+	"github.com/mudler/LocalAI/core/cli/workerregistry"
+	"github.com/mudler/LocalAI/core/services/nodes"
+	"github.com/mudler/LocalAI/pkg/system"
+	"github.com/mudler/LocalAI/pkg/xsysinfo"
+	"github.com/mudler/xlog"
+)
+
+// vLLMFollowerRoleLabel marks a node as a vLLM data-parallel follower.
+// Operators scope regular models away from these nodes via inverse
+// selectors like {"!node.role":"vllm-follower"}.
+const vLLMFollowerRoleLabel = "vllm-follower"
+
+// VLLMDistributed runs a vLLM follower process for multi-node
+// data-parallel inference. The head runs LocalAI's existing single-
+// node vLLM gRPC backend with engine_args.data_parallel_size > 1;
+// followers run vanilla `vllm serve --headless ...` and speak ZMQ
+// directly to the head.
+//
+// The follower is operator-launched (no NATS / SmartRouter placement
+// in this iteration). When --register-to is set, the worker self-
+// registers as an agent-type node so it shows up in the admin UI; a
+// `node.role=vllm-follower` label discourages model placement on it.
+type VLLMDistributed struct {
+	WorkerFlags `embed:""`
+
+	// Registration (optional). Without these the worker just runs vLLM
+	// and exits — no UI visibility. With them set, the follower
+	// registers as an agent-type node, heartbeats while vLLM is
+	// running, and deregisters on shutdown.
+	RegisterTo        string `env:"LOCALAI_REGISTER_TO" help:"Frontend URL for self-registration. Empty = no registration." group:"registration"`
+	RegistrationToken string `env:"LOCALAI_REGISTRATION_TOKEN" help:"Token for authenticating with the frontend" group:"registration"`
+	NodeName          string `env:"LOCALAI_NODE_NAME" help:"Node name for registration (defaults to vllm-<hostname>)" group:"registration"`
+	NodeLabels        string `env:"LOCALAI_NODE_LABELS" help:"Comma-separated key=value labels for this node (node.role=vllm-follower is always added)" group:"registration"`
+	HeartbeatInterval string `env:"LOCALAI_HEARTBEAT_INTERVAL" default:"10s" help:"Interval between heartbeats" group:"registration"`
+
+	// vLLM data-parallel placement. The head must advertise the same
+	// data_parallel_size / data_parallel_rpc_port via its engine_args;
+	// followers use --master-addr / --master-port to find it.
+	Model                 string   `arg:"" help:"HuggingFace model ID or local path (must match the head)"`
+	DataParallelSize      int      `name:"data-parallel-size" env:"VLLM_DATA_PARALLEL_SIZE" required:"" help:"Total DP ranks across all nodes"`
+	DataParallelSizeLocal int      `name:"data-parallel-size-local" env:"VLLM_DATA_PARALLEL_SIZE_LOCAL" required:"" help:"DP ranks on this node"`
+	StartRank             int      `name:"start-rank" env:"VLLM_DATA_PARALLEL_START_RANK" required:"" help:"Starting DP rank for this node (>0 for followers)"`
+	MasterAddr            string   `name:"master-addr" env:"VLLM_DP_MASTER_ADDR" required:"" help:"Head node IP/hostname for DP RPC handshake"`
+	MasterPort            int      `name:"master-port" env:"VLLM_DP_MASTER_PORT" required:"" help:"Head node DP RPC port"`
+	Headless              bool     `env:"VLLM_HEADLESS" default:"true" negatable:"" help:"Headless follower mode (no API server)"`
+	ExtraArgs             []string `name:"vllm-arg" env:"VLLM_EXTRA_ARGS" help:"Additional CLI args passed verbatim to vllm serve (e.g. --tensor-parallel-size 2). May be repeated."`
+}
+
+func (r *VLLMDistributed) Run(ctx *cliContext.Context) error {
+	// Rank 0 is the head: it must serve the OpenAI API. --headless
+	// disables that, so the combination is operator error and would
+	// silently produce a cluster that can't accept requests.
+	if r.Headless && r.StartRank == 0 {
+		return fmt.Errorf("--start-rank 0 (head) cannot be --headless; the head serves the API")
+	}
+
+	systemState, err := system.GetSystemState(
+		system.WithBackendPath(r.BackendsPath),
+		system.WithBackendSystemPath(r.BackendsSystemPath),
+	)
+	if err != nil {
+		return fmt.Errorf("getting system state: %w", err)
+	}
+
+	backendPath, err := findBackendPath("vllm", r.BackendGalleries, systemState)
+	if err != nil {
+		return fmt.Errorf("cannot find vllm backend: %w", err)
+	}
+
+	args := r.buildVLLMArgs()
+	runSh := filepath.Join(backendPath, "run.sh")
+
+	shutdownCtx, shutdownCancel := context.WithCancel(context.Background())
+	defer shutdownCancel()
+
+	// Self-register so the follower is visible in the admin UI. Done
+	// before vLLM starts so an unreachable frontend fails fast rather
+	// than after the GPU is already loaded.
+	if r.RegisterTo != "" {
+		regClient := &workerregistry.RegistrationClient{
+			FrontendURL:       r.RegisterTo,
+			RegistrationToken: r.RegistrationToken,
+		}
+		nodeID, _, regErr := regClient.RegisterWithRetry(context.Background(), r.registrationBody(), 10)
+		if regErr != nil {
+			return fmt.Errorf("registering with frontend: %w", regErr)
+		}
+		xlog.Info("Registered with frontend", "nodeID", nodeID, "frontend", r.RegisterTo, "role", "vllm-follower")
+
+		heartbeatInterval, _ := time.ParseDuration(r.HeartbeatInterval)
+		heartbeatInterval = cmp.Or(heartbeatInterval, 10*time.Second)
+		go regClient.HeartbeatLoop(shutdownCtx, nodeID, heartbeatInterval, r.heartbeatBody)
+
+		defer regClient.GracefulDeregister(nodeID)
+	}
+
+	xlog.Info("Starting vllm follower",
+		"model", r.Model,
+		"data-parallel-size", r.DataParallelSize,
+		"data-parallel-size-local", r.DataParallelSizeLocal,
+		"start-rank", r.StartRank,
+		"master", fmt.Sprintf("%s:%d", r.MasterAddr, r.MasterPort),
+	)
+
+	cmd := exec.CommandContext(shutdownCtx, runSh, args...)
+	// VLLM_DP_* env vars are belt-and-braces alongside the explicit
+	// CLI flags — vLLM honours both (vllm/envs.py:142-148).
+	cmd.Env = append(os.Environ(),
+		fmt.Sprintf("VLLM_DP_MASTER_IP=%s", r.MasterAddr),
+		fmt.Sprintf("VLLM_DP_MASTER_PORT=%d", r.MasterPort),
+		fmt.Sprintf("VLLM_DP_SIZE=%d", r.DataParallelSize),
+		fmt.Sprintf("VLLM_DP_RANK=%d", r.StartRank),
+		"VLLM_DP_RANK_LOCAL=0",
+	)
+	cmd.Stdout = os.Stdout
+	cmd.Stderr = os.Stderr
+	cmd.Stdin = os.Stdin
+
+	// Forward INT/TERM to vLLM so it gets a chance to clean up its ZMQ
+	// sockets. exec.CommandContext kills with SIGKILL on cancellation,
+	// which we want as a fallback only.
+	sigCh := make(chan os.Signal, 1)
+	signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
+	defer signal.Stop(sigCh)
+
+	if err := cmd.Start(); err != nil {
+		return fmt.Errorf("starting vllm: %w", err)
+	}
+
+	waitErr := make(chan error, 1)
+	go func() { waitErr <- cmd.Wait() }()
+
+	for {
+		select {
+		case sig := <-sigCh:
+			xlog.Info("Forwarding signal to vllm", "signal", sig)
+			if cmd.Process != nil {
+				_ = cmd.Process.Signal(sig)
+			}
+		case err := <-waitErr:
+			return err
+		}
+	}
+}
+
+// buildVLLMArgs assembles the vLLM CLI argv. Factored out for unit
+// testing — Run is hard to test without a real backend install.
+func (r *VLLMDistributed) buildVLLMArgs() []string {
+	args := []string{"serve", r.Model}
+	if r.Headless {
+		args = append(args, "--headless")
+	}
+	args = append(args,
+		"--data-parallel-size", strconv.Itoa(r.DataParallelSize),
+		"--data-parallel-size-local", strconv.Itoa(r.DataParallelSizeLocal),
+		"--data-parallel-start-rank", strconv.Itoa(r.StartRank),
+		"--data-parallel-address", r.MasterAddr,
+		"--data-parallel-rpc-port", strconv.Itoa(r.MasterPort),
+	)
+	args = append(args, r.ExtraArgs...)
+	return args
+}
+
+// registrationBody mirrors agent_worker.go's shape: agent-type nodes
+// don't need an address, which fits a follower that doesn't host any
+// LocalAI gRPC backends. The node.role label lets operators scope
+// regular model placement away from followers.
+func (r *VLLMDistributed) registrationBody() map[string]any {
+	nodeName := r.NodeName
+	if nodeName == "" {
+		hostname, err := os.Hostname()
+		if err != nil {
+			nodeName = fmt.Sprintf("vllm-follower-%d", os.Getpid())
+		} else {
+			nodeName = "vllm-" + hostname
+		}
+	}
+
+	totalVRAM, _ := xsysinfo.TotalAvailableVRAM()
+	gpuVendor, _ := xsysinfo.DetectGPUVendor()
+
+	body := map[string]any{
+		"name":           nodeName,
+		"node_type":      nodes.NodeTypeAgent,
+		"total_vram":     totalVRAM,
+		"available_vram": totalVRAM,
+		"gpu_vendor":     gpuVendor,
+	}
+	if r.RegistrationToken != "" {
+		body["token"] = r.RegistrationToken
+	}
+
+	labels := ParseNodeLabels(r.NodeLabels)
+	labels["node.role"] = vLLMFollowerRoleLabel
+	body["labels"] = labels
+	return body
+}
+
+func (r *VLLMDistributed) heartbeatBody() map[string]any {
+	body := map[string]any{}
+	aggregate := xsysinfo.GetGPUAggregateInfo()
+	if aggregate.TotalVRAM > 0 {
+		body["available_vram"] = aggregate.FreeVRAM
+	}
+	return body
+}
--- a/core/cli/worker/worker_vllm_test.go
+++ b/core/cli/worker/worker_vllm_test.go
@@ -0,0 +1,105 @@
+package worker
+
+import (
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("VLLMDistributed", func() {
+	Describe("buildVLLMArgs", func() {
+		DescribeTable("produces the expected vLLM CLI argv",
+			func(cmd VLLMDistributed, want []string) {
+				Expect(cmd.buildVLLMArgs()).To(Equal(want))
+			},
+			Entry("headless follower with explicit master",
+				VLLMDistributed{
+					Model:                 "Qwen/Qwen3.5-1.5B",
+					DataParallelSize:      4,
+					DataParallelSizeLocal: 2,
+					StartRank:             2,
+					MasterAddr:            "10.0.0.1",
+					MasterPort:            32100,
+					Headless:              true,
+				},
+				[]string{
+					"serve", "Qwen/Qwen3.5-1.5B",
+					"--headless",
+					"--data-parallel-size", "4",
+					"--data-parallel-size-local", "2",
+					"--data-parallel-start-rank", "2",
+					"--data-parallel-address", "10.0.0.1",
+					"--data-parallel-rpc-port", "32100",
+				},
+			),
+			Entry("head-style invocation: rank 0, not headless",
+				VLLMDistributed{
+					Model:                 "moonshotai/Kimi-K2.6-Instruct",
+					DataParallelSize:      8,
+					DataParallelSizeLocal: 4,
+					StartRank:             0,
+					MasterAddr:            "127.0.0.1",
+					MasterPort:            32100,
+					Headless:              false,
+				},
+				[]string{
+					"serve", "moonshotai/Kimi-K2.6-Instruct",
+					"--data-parallel-size", "8",
+					"--data-parallel-size-local", "4",
+					"--data-parallel-start-rank", "0",
+					"--data-parallel-address", "127.0.0.1",
+					"--data-parallel-rpc-port", "32100",
+				},
+			),
+			Entry("extra args appended verbatim",
+				VLLMDistributed{
+					Model:                 "Qwen/Qwen3.5-1.5B",
+					DataParallelSize:      2,
+					DataParallelSizeLocal: 1,
+					StartRank:             1,
+					MasterAddr:            "head.local",
+					MasterPort:            32100,
+					Headless:              true,
+					ExtraArgs:             []string{"--tensor-parallel-size", "2", "--enable-expert-parallel"},
+				},
+				[]string{
+					"serve", "Qwen/Qwen3.5-1.5B",
+					"--headless",
+					"--data-parallel-size", "2",
+					"--data-parallel-size-local", "1",
+					"--data-parallel-start-rank", "1",
+					"--data-parallel-address", "head.local",
+					"--data-parallel-rpc-port", "32100",
+					"--tensor-parallel-size", "2",
+					"--enable-expert-parallel",
+				},
+			),
+		)
+	})
+
+	Describe("registrationBody", func() {
+		// Followers don't host LocalAI gRPC, so node_type must be "agent"
+		// to bypass the address requirement on /api/node/register, and the
+		// node.role label is the contract operators rely on to scope normal
+		// model placement away from these nodes.
+		It("registers as agent-type with the vllm-follower role label", func() {
+			cmd := VLLMDistributed{
+				NodeName:              "test-follower",
+				DataParallelSize:      4,
+				DataParallelSizeLocal: 2,
+				StartRank:             2,
+				MasterAddr:            "10.0.0.1",
+				NodeLabels:            "tier=fast,gpu.vendor=nvidia",
+			}
+			body := cmd.registrationBody()
+
+			Expect(body).To(HaveKeyWithValue("node_type", "agent"))
+			Expect(body).To(HaveKeyWithValue("name", "test-follower"))
+
+			labels, ok := body["labels"].(map[string]string)
+			Expect(ok).To(BeTrue(), "labels must be map[string]string")
+			Expect(labels).To(HaveKeyWithValue("node.role", "vllm-follower"))
+			Expect(labels).To(HaveKeyWithValue("tier", "fast"))
+			Expect(labels).To(HaveKeyWithValue("gpu.vendor", "nvidia"))
+		})
+	})
+})
--- a/docs/content/features/distributed-mode.md
+++ b/docs/content/features/distributed-mode.md
@@ -305,6 +305,97 @@ MCP servers configured in model configs work in distributed mode. The frontend r
 - **MCP tool execution** (during `/v1/chat/completions`): tool calls are routed to agent workers via NATS request-reply
 - **MCP CI jobs**: executed entirely on agent workers with access to docker for stdio-based MCP servers

+## vLLM Multi-Node (Data-Parallel)
+
+A single vLLM model can span multiple GPU nodes via data parallelism: the head node serves the OpenAI API and runs the local DP ranks, follower nodes run vanilla `vllm serve --headless` and speak ZMQ directly to the head. LocalAI's role is starting the follower processes and surfacing them in the admin UI; the cross-rank tensor traffic is vLLM's own.
+
+This mode is **operator-launched** — the head config and each follower's invocation must agree on the topology (`data_parallel_size`, `data_parallel_size_local`, `data_parallel_address`, `data_parallel_rpc_port`). The SmartRouter does not place follower ranks automatically.
+
+### Head node configuration
+
+The head runs the existing single-node vLLM gRPC backend. Set `engine_args` to publish the DP topology vLLM expects:
+
+```yaml
+backend: vllm
+parameters:
+  model: moonshotai/Kimi-K2.6-Instruct
+engine_args:
+  data_parallel_size: 4              # total ranks across all nodes
+  data_parallel_size_local: 2        # ranks on the head node
+  data_parallel_address: 10.0.0.1    # head's reachable IP
+  data_parallel_rpc_port: 32100      # any free port; followers connect here
+  enable_expert_parallel: true       # for MoE models
+```
+
+The head will start its 2 local ranks, listen on `10.0.0.1:32100`, and wait for the remaining 2 ranks to handshake.
+
+### Follower nodes
+
+Each follower runs `local-ai p2p-worker vllm` with matching topology, an explicit start rank, and the head's address:
+
+```bash
+local-ai p2p-worker vllm \
+  moonshotai/Kimi-K2.6-Instruct \
+  --data-parallel-size 4 \
+  --data-parallel-size-local 2 \
+  --start-rank 2 \
+  --master-addr 10.0.0.1 \
+  --master-port 32100 \
+  --register-to http://frontend:8080 \
+  --registration-token changeme
+```
+
+`--register-to` is optional but recommended — it makes the follower visible in the admin UI as an `agent`-type node tagged with `node.role=vllm-follower`. Without it the worker just runs vLLM and exits silently when vLLM does. The role label discourages SmartRouter from placing other models on the follower; pair it with model selectors like `{"!node.role":"vllm-follower"}` if you also run regular LocalAI models on the same fleet.
+
+### Worked example: 2-node Kimi-K2.6 deployment
+
+Two A100 nodes (`10.0.0.1`, `10.0.0.2`), 8 GPUs total, `data_parallel_size=8` with 4 ranks per node:
+
+```yaml
+# /models/kimi.yaml on the head (10.0.0.1)
+name: kimi-k2-6
+backend: vllm
+parameters:
+  model: moonshotai/Kimi-K2.6-Instruct
+engine_args:
+  data_parallel_size: 8
+  data_parallel_size_local: 4
+  data_parallel_address: 10.0.0.1
+  data_parallel_rpc_port: 32100
+  enable_expert_parallel: true
+  all2all_backend: deepep_high_throughput
+```
+
+```bash
+# On 10.0.0.2 (follower)
+local-ai p2p-worker vllm moonshotai/Kimi-K2.6-Instruct \
+  --data-parallel-size 8 --data-parallel-size-local 4 --start-rank 4 \
+  --master-addr 10.0.0.1 --master-port 32100 \
+  --register-to http://10.0.0.1:8080 --registration-token changeme
+```
+
+A `curl http://10.0.0.1:8080/v1/chat/completions ...` against the head will then dispatch across all 8 ranks.
+
+### Intel Arc / XPU notes
+
+vLLM XPU supports DP (`vllm/platforms/xpu.py:198` handles `world_size_across_dp > 1`; ranks bind to `xpu:{local_rank}` in `xpu_worker.py:62`, with xccl as the collective backend). Each rank still needs a distinct discrete GPU — the iGPU on a hybrid host is not a viable second device.
+
+Older XE-HPG GPUs (e.g. Arc A770) need to bypass the cutlass attention path:
+
+```yaml
+engine_args:
+  attention_backend: TRITON_ATTN
+```
+
+`docker-compose.vllm-multinode.intel.yaml` at the repo root is the Intel equivalent of `docker-compose.vllm-multinode.yaml` — uses `/dev/dri` passthrough, `ZE_AFFINITY_MASK` to pin each rank to one device, and `latest-gpu-intel` images. Run via `./tests/e2e/vllm-multinode/smoke.sh --intel`.
+
+### Caveats
+
+- **Tensor parallel within a node only.** vLLM v1 does not support TP across nodes; combine `tensor_parallel_size` (within a node, via `engine_args`) with `data_parallel_size` (across nodes).
+- **Followers don't host LocalAI gRPC.** The follower process is vanilla vLLM, so `/api/backend-logs/<modelId>` does not stream follower output. Use `journalctl` / `kubectl logs` / compose logs for the follower's stderr.
+- **Network reachability.** The head's `data_parallel_rpc_port` plus a range of ZMQ ports (typically `data_parallel_rpc_port..+N`) must be reachable from every follower. Open them in your firewall / security group.
+- **Topology must match exactly.** A mismatch in `--data-parallel-size` between head and any follower will hang the handshake. Check the head's vLLM logs for `waiting for N DP ranks` if startup stalls.
+
 ## Scaling

 **Adding worker capacity:** Start additional `worker` instances pointing to the same frontend. They self-register automatically:
--- a/docs/content/features/text-generation.md
+++ b/docs/content/features/text-generation.md
@@ -705,6 +705,22 @@ specific target models; pick the one that matches your target. The
 drafter loads in its native precision regardless of the target's
 `quantization:` setting.

+Another example — picking a non-default attention backend (e.g. on
+hardware where the default cutlass kernels aren't supported):
+
+```yaml
+engine_args:
+  attention_backend: TRITON_ATTN
+```
+
+#### Multi-node data parallelism
+
+`engine_args.data_parallel_size > 1` combined with the
+`local-ai p2p-worker vllm` follower lets a single model span multiple
+GPU nodes. See [vLLM Multi-Node (Data-Parallel)]({{% relref
+"/docs/features/distributed-mode#vllm-multi-node-data-parallel" %}})
+for the head/follower configuration and a worked Kimi-K2.6 example.
+
 ### Transformers

 [Transformers](https://huggingface.co/docs/transformers/index) is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX.
--- a/tests/e2e/distributed/vllm_multinode_test.go
+++ b/tests/e2e/distributed/vllm_multinode_test.go
@@ -0,0 +1,245 @@
+package distributed_test
+
+import (
+	"bytes"
+	"context"
+	"encoding/json"
+	"fmt"
+	"io"
+	"net/http"
+	"os"
+	"path/filepath"
+	"runtime"
+	"time"
+
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+
+	"github.com/testcontainers/testcontainers-go"
+	"github.com/testcontainers/testcontainers-go/network"
+	"github.com/testcontainers/testcontainers-go/wait"
+)
+
+// vLLM data-parallel deployment config served by the head. KV cache is
+// trimmed because the CPU smoke runs two engines on one box and the
+// prebuilt wheel auto-sizes KV to fill RAM otherwise.
+const qwenDPYAML = `name: qwen-dp
+backend: vllm
+parameters:
+  model: Qwen/Qwen2.5-0.5B-Instruct
+context_size: 512
+trust_remote_code: true
+template:
+  use_tokenizer_template: true
+engine_args:
+  data_parallel_size: 2
+  data_parallel_size_local: 1
+  data_parallel_address: localai-head
+  data_parallel_rpc_port: 32100
+  enforce_eager: true
+  max_model_len: 512
+`
+
+// End-to-end smoke for `local-ai p2p-worker vllm`. Two containers from
+// the locally-built `local-ai:tests` image — head + headless follower
+// — share a docker network and a backend bind-mount (so the cpu-vllm
+// backend extracted by `make extract-backend-vllm` is seen as a system
+// backend, no gallery fetch). DP=2 on a 0.5B model on CPU; the test
+// asserts /readyz comes up across both ranks and a chat completion
+// returns non-empty content.
+//
+// Required preconditions (the `test-e2e-vllm-multinode` Make target
+// sets these up):
+//   - `local-ai:tests` image built (docker-build-e2e)
+//   - `local-backends/vllm/` populated (extract-backend-vllm)
+//   - LOCALAI_VLLM_BACKEND_DIR env var pointing at the extracted dir
+var _ = Describe("vLLM multi-node DP on CPU", Ordered, Label("Distributed", "VLLMMultinode"), func() {
+	var baseURL string
+
+	BeforeAll(func() {
+		ctx := context.Background()
+
+		image := vllmEnvOrDefault("LOCALAI_IMAGE", "local-ai")
+		tag := vllmEnvOrDefault("LOCALAI_IMAGE_TAG", "tests")
+		imageRef := fmt.Sprintf("%s:%s", image, tag)
+
+		// LOCALAI_VLLM_BACKEND_DIR is set by the dedicated
+		// `make test-e2e-vllm-multinode` target. The general
+		// `make test-e2e` target picks this file up too via
+		// `ginkgo -r ./tests/e2e`; in that context skip rather
+		// than fail.
+		backendDir := os.Getenv("LOCALAI_VLLM_BACKEND_DIR")
+		if backendDir == "" {
+			Skip("LOCALAI_VLLM_BACKEND_DIR not set — run `make test-e2e-vllm-multinode`")
+		}
+		Expect(filepath.Join(backendDir, "run.sh")).To(BeAnExistingFile(),
+			"extracted backend missing run.sh — check the extract-backend-vllm output")
+
+		// State dir for the head: holds qwen-dp.yaml and is also where
+		// LocalAI redirects HF_HOME for backend subprocesses
+		// (pkg/model/initializers.go:76), so Qwen weights accumulate
+		// here. Stable gitignored path under local-backends/ so the
+		// container's root-owned writes don't trip Ginkgo's TempDir
+		// cleanup, and successive runs reuse the ~1 GB download.
+		configDir := filepath.Join(thisFileDir(), "..", "..", "..", "local-backends", "vllm-multinode-state")
+		Expect(os.MkdirAll(configDir, 0o755)).To(Succeed())
+		Expect(os.WriteFile(filepath.Join(configDir, "qwen-dp.yaml"), []byte(qwenDPYAML), 0o644)).To(Succeed())
+
+		net, err := network.New(ctx)
+		Expect(err).ToNot(HaveOccurred())
+		DeferCleanup(func() {
+			_ = net.Remove(context.Background())
+		})
+
+		commonMounts := testcontainers.ContainerMounts{
+			{
+				Source: testcontainers.DockerBindMountSource{HostPath: backendDir},
+				Target: "/var/lib/local-ai/backends/vllm",
+			},
+		}
+
+		// Head: rank 0, serves the OpenAI API. We wait briefly for the
+		// HTTP port to bind (so MappedPort returns), then poll /readyz
+		// with a long budget for the model load + DP handshake.
+		head, err := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
+			ContainerRequest: testcontainers.ContainerRequest{
+				Image:        imageRef,
+				ExposedPorts: []string{"8080/tcp"},
+				Cmd:          []string{"run", "/models/qwen-dp.yaml"},
+				Env: map[string]string{
+					"LOCALAI_ADDRESS": "0.0.0.0:8080",
+					// Cap KV cache per rank so two CPU engines fit on
+					// one host. The prebuilt wheel auto-sizes from
+					// available RAM otherwise and OOM-kills with two
+					// ranks sharing a 32 GB box.
+					"VLLM_CPU_KVCACHE_SPACE": "1",
+					// The backend dir is bind-mounted from the host;
+					// without this, Python writes .pyc files into
+					// __pycache__ as root and `rm -rf local-backends/`
+					// fails on the next `make extract-backend-vllm`.
+					"PYTHONDONTWRITEBYTECODE": "1",
+				},
+				Networks:       []string{net.Name},
+				NetworkAliases: map[string][]string{net.Name: {"localai-head"}},
+				Mounts: append(commonMounts,
+					testcontainers.ContainerMount{
+						// Not read-only: LocalAI writes back auto-
+						// detected hooks (parser defaults, ...) into
+						// the config and HF cache files into this
+						// dir.
+						Source: testcontainers.DockerBindMountSource{HostPath: configDir},
+						Target: "/models",
+					}),
+				LogConsumerCfg: &testcontainers.LogConsumerConfig{
+					Consumers: []testcontainers.LogConsumer{&vllmLogConsumer{prefix: "head"}},
+				},
+				WaitingFor: wait.ForListeningPort("8080/tcp").WithStartupTimeout(2 * time.Minute),
+			},
+			Started: true,
+		})
+		Expect(err).ToNot(HaveOccurred())
+		DeferCleanup(func() {
+			_ = head.Terminate(context.Background())
+		})
+
+		// Follower: rank 1, headless. Speaks ZMQ directly to the head
+		// rank — no LocalAI gRPC; `p2p-worker vllm` exec's vllm serve.
+		follower, err := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
+			ContainerRequest: testcontainers.ContainerRequest{
+				Image: imageRef,
+				Cmd: []string{
+					"p2p-worker", "vllm", "Qwen/Qwen2.5-0.5B-Instruct",
+					"--data-parallel-size=2",
+					"--data-parallel-size-local=1",
+					"--start-rank=1",
+					"--master-addr=localai-head",
+					"--master-port=32100",
+					// Mirror max_model_len from qwen-dp.yaml so both
+					// ranks agree on the KV cache shape.
+					"--vllm-arg=--max-model-len=512",
+				},
+				Env: map[string]string{
+					"VLLM_CPU_KVCACHE_SPACE":  "1",
+					"PYTHONDONTWRITEBYTECODE": "1",
+				},
+				Networks: []string{net.Name},
+				Mounts:   commonMounts,
+				LogConsumerCfg: &testcontainers.LogConsumerConfig{
+					Consumers: []testcontainers.LogConsumer{&vllmLogConsumer{prefix: "follower"}},
+				},
+			},
+			Started: true,
+		})
+		Expect(err).ToNot(HaveOccurred())
+		DeferCleanup(func() {
+			_ = follower.Terminate(context.Background())
+		})
+
+		port, err := head.MappedPort(ctx, "8080/tcp")
+		Expect(err).ToNot(HaveOccurred())
+		baseURL = fmt.Sprintf("http://localhost:%s", port.Port())
+
+		Eventually(func() (int, error) {
+			resp, err := http.Get(baseURL + "/readyz")
+			if err != nil {
+				return 0, err
+			}
+			defer func() { _ = resp.Body.Close() }()
+			return resp.StatusCode, nil
+		}, "20m", "10s").Should(Equal(http.StatusOK), "head /readyz never went green — both ranks need to load the model and complete the ZMQ handshake")
+	})
+
+	It("serves a chat completion across both ranks", func() {
+		body, err := json.Marshal(map[string]any{
+			"model": "qwen-dp",
+			"messages": []map[string]string{
+				{"role": "user", "content": "Reply with the single word: pong."},
+			},
+			"max_tokens":  16,
+			"temperature": 0,
+		})
+		Expect(err).ToNot(HaveOccurred())
+
+		resp, err := http.Post(baseURL+"/v1/chat/completions", "application/json", bytes.NewReader(body))
+		Expect(err).ToNot(HaveOccurred())
+		defer func() { _ = resp.Body.Close() }()
+
+		raw, err := io.ReadAll(resp.Body)
+		Expect(err).ToNot(HaveOccurred())
+		Expect(resp.StatusCode).To(Equal(http.StatusOK), "non-200 from chat/completions: %s", string(raw))
+
+		var parsed struct {
+			Choices []struct {
+				Message struct {
+					Content string `json:"content"`
+				} `json:"message"`
+			} `json:"choices"`
+		}
+		Expect(json.Unmarshal(raw, &parsed)).To(Succeed())
+		Expect(parsed.Choices).ToNot(BeEmpty())
+		Expect(parsed.Choices[0].Message.Content).ToNot(BeEmpty())
+	})
+})
+
+type vllmLogConsumer struct {
+	prefix string
+}
+
+func (l *vllmLogConsumer) Accept(log testcontainers.Log) {
+	_, _ = GinkgoWriter.Write([]byte("[" + l.prefix + "] " + string(log.Content)))
+}
+
+func vllmEnvOrDefault(key, def string) string {
+	if v := os.Getenv(key); v != "" {
+		return v
+	}
+	return def
+}
+
+// thisFileDir returns the directory of this test file so the test can
+// be run from any working directory (`go test ./...` from the repo
+// root is the common case).
+func thisFileDir() string {
+	_, file, _, _ := runtime.Caller(0)
+	return filepath.Dir(file)
+}