mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-16 20:52:08 -04:00
feat(vllm, distributed): tensor parallel distributed workers (#9612)
* feat(vllm): build vllm from source for Intel XPU
Upstream publishes no XPU wheels for vllm. The Intel profile was
silently picking up a non-XPU wheel that imported but errored at
engine init, and several runtime deps (pillow, charset-normalizer,
chardet) were missing on Intel -- backend.py crashed at import time
before the gRPC server came up.
Switch the Intel profile to upstream's documented from-source
procedure (docs/getting_started/installation/gpu.xpu.inc.md in
vllm-project/vllm):
- Bump portable Python to 3.12 -- vllm-xpu-kernels ships only a
cp312 wheel.
- Source /opt/intel/oneapi/setvars.sh so vllm's CMake build sees
the dpcpp/sycl compiler from the oneapi-basekit base image.
- Hide requirements-intel-after.txt during installRequirements
(it used to 'pip install vllm'); install vllm's deps from a
fresh git clone of vllm via 'uv pip install -r
requirements/xpu.txt', swap stock triton for
triton-xpu==3.7.0, then 'VLLM_TARGET_DEVICE=xpu uv pip install
--no-deps .'.
- requirements-intel.txt trimmed to LocalAI's direct deps
(accelerate / transformers / bitsandbytes); torch-xpu, vllm,
vllm_xpu_kernels and the rest come from upstream's xpu.txt
during the source build.
- requirements.txt: add pillow + charset-normalizer + chardet --
used by backend.py and missing on the Intel install profile.
- run.sh: 'set -x' so backend startup is visible in container
logs (the gRPC startup error path was previously opaque).
Also adds a one-line docs example for engine_args.attention_backend
under the vLLM section, since older XE-HPG GPUs (e.g. Arc A770)
need TRITON_ATTN to bypass the cutlass path in vllm_xpu_kernels.
Tested end-to-end on an Intel Arc A770 with Qwen2.5-0.5B-Instruct
via LocalAI's /v1/chat/completions.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(vllm): add multi-node data-parallel follower worker
vLLM v1's multi-node story is one process per node sharing a DP
coordinator over ZMQ -- the head runs the API server with
data_parallel_size > 1 and followers run `vllm serve --headless ...`
with matching topology. Today LocalAI can already configure DP on the
head via the engine_args YAML map, but there's no way to bring up the
follower nodes -- so the head sits waiting for ranks that never
handshake.
Add `local-ai p2p-worker vllm`, mirroring MLXDistributed's structural
precedent (operator-launched, static config, no NATS placement). The
worker:
- Optionally self-registers with the frontend as an agent-type node
tagged `node.role=vllm-follower` so it's visible in the admin UI
and operators can scope ordinary models away via inverse
selectors.
- Resolves the platform-specific vllm backend via the gallery's
"vllm" meta-entry (cuda*, intel-vllm, rocm-vllm, ...).
- Runs vLLM as a child process so the heartbeat goroutine survives
until vLLM exits; forwards SIGINT/SIGTERM so vLLM can clean up its
ZMQ sockets before we tear down.
- Validates --headless + --start-rank 0 is rejected (rank 0 is the
head and must serve the API).
Backend run.sh dispatches `serve` as the first arg to vllm's own CLI
instead of LocalAI's backend.py gRPC server -- the follower speaks
ZMQ directly to the head, there is no LocalAI gRPC on the follower
side. Single-node usage is unchanged.
Generalises the gallery resolution helper into findBackendPath()
shared by MLX and vLLM workers; extracts ParseNodeLabels for the
comma-separated label parsing both use.
Ships with two compose recipes (`docker-compose.vllm-multinode.yaml`
for NVIDIA, `docker-compose.vllm-multinode.intel.yaml` for Intel
XPU/xccl) plus `tests/e2e/vllm-multinode/smoke.sh`. Both vendors are
supported (NCCL for CUDA/ROCm, xccl for XPU) but mixed-vendor DP is
not -- PyTorch's process group requires every rank to use the same
collective backend, and NCCL/xccl/gloo don't interoperate.
Out of scope (deferred): SmartRouter-driven placement of follower
ranks via NATS backend.install events, follower log streaming through
/api/backend-logs, tensor-parallel across nodes, disaggregated
prefill via KVTransferConfig.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* test(vllm): CPU-only end-to-end test for multi-node DP
Adds tests/e2e/vllm-multinode/, a Ginkgo + testcontainers-go suite
that brings up a head + headless follower from the locally-built
local-ai:tests image, bind-mounts the cpu-vllm backend extracted by
make extract-backend-vllm so it's seen as a system backend (no gallery
fetch, no registry server), and asserts a chat completion across both
DP ranks. New `make test-e2e-vllm-multinode` target wires the docker
build, backend extract, and ginkgo run together; BuildKit caches both
images so re-runs only rebuild what changed. Tagged Label("VLLMMultinode")
so the existing distributed suite isn't pulled along.
Two pre-existing bugs surfaced by the test:
1. extract-backend-% (Makefile) failed for every backend, because all
backend images end with `FROM scratch` and `docker create` rejects
an image with no CMD/ENTRYPOINT. Fixed by passing
--entrypoint=/run.sh -- the container is never started, only
docker-cp'd, so the path doesn't have to exist; we just need
anything that satisfies the daemon's create-time validation.
2. backend/python/vllm/run.sh's `serve` shortcut for the multi-node DP
follower exec'd ${EDIR}/venv/bin/vllm directly, but uv bakes an
absolute build-time shebang (`#!/vllm/venv/bin/python3`) that no
longer resolves once the backend is relocated to BackendsPath.
_makeVenvPortable's shebang rewriter only matches paths that
already point at ${EDIR}, so the original shebang slips through
unchanged. Fixed by exec-ing ${EDIR}/venv/bin/python with the script
as an argument -- Python ignores the script's shebang in that case.
The test fixture caps memory aggressively (max_model_len=512,
VLLM_CPU_KVCACHE_SPACE=1, TORCH_COMPILE_DISABLE=1) so two CPU engines
fit on a 32 GB box. TORCH_COMPILE_DISABLE is currently mandatory for
cpu-vllm: torch._inductor's CPU-ISA probe runs even with
enforce_eager=True and needs g++ on PATH, which the LocalAI runtime
image doesn't ship -- to be addressed in a follow-up that bundles a
toolchain in the cpu-vllm backend.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(vllm): bundle a g++ toolchain in the cpu-vllm backend image
torch._inductor's CPU-ISA probe (`cpu_model_runner.py:65 "Warming up
model for the compilation"`) shells out to `g++` at vllm engine
startup, regardless of `enforce_eager=True` -- the eager flag only
disables CUDA graphs, not inductor's first-batch warmup. The LocalAI
CPU runtime image (Dockerfile, unconditional apt list) does not ship
build-essential, and the cpu-vllm backend image is `FROM scratch`,
so any non-trivial inference on cpu-vllm crashes with:
torch._inductor.exc.InductorError:
InvalidCxxCompiler: No working C++ compiler found in
torch._inductor.config.cpp.cxx: (None, 'g++')
Bundling the toolchain in the CPU runtime image would bloat every
non-vllm-CPU deployment and force a single GCC version on backends
that may want clang or a different version. So this lives in the
backend, gated to BUILD_TYPE=='' (the CPU profile).
`package.sh` snapshots g++ + binutils + cc1plus + libstdc++ + libc6
(runtime + dev) + the math libs cc1plus links (libisl/libmpc/libmpfr/
libjansson) into ${BACKEND}/toolchain/, mirroring /usr/... layout. The
unversioned binaries on Debian/Ubuntu are symlink chains pointing into
multiarch packages (`g++` -> `g++-13` -> `x86_64-linux-gnu-g++-13`,
the latter in `g++-13-x86-64-linux-gnu`), so the package list resolves
both the version and the arch-triplet variant. Symlinks /lib ->
usr/lib and /lib64 -> usr/lib64 are recreated under the toolchain
root because Ubuntu's UsrMerge keeps them at /, and ld scripts
(`libc.so`, `libm.so`) hardcode `/lib/...` paths that --sysroot
re-roots into the toolchain.
The unversioned `g++`/`gcc`/`cpp` symlinks are replaced with wrapper
shell scripts that resolve their own location at runtime and pass
`--sysroot=<toolchain>` and `-B <toolchain>/usr/lib/gcc/<triplet>/<ver>/`
to the underlying versioned binary. That's how torch's bare `g++ foo.cpp
-o foo` invocation finds cc1plus (-B), system headers (--sysroot), and
the bundled libstdc++ (--sysroot, --sysroot is recursive into linker).
`run.sh` adds the toolchain bin dir to PATH and the toolchain's
shared-lib dir to LD_LIBRARY_PATH -- everything else (header search,
linker search, executable search) is encapsulated in the wrappers.
No-op for non-CPU builds, the dir doesn't exist there.
The cpu-vllm image grows by ~217 MB. Tradeoff is acceptable -- cpu-vllm
is already a niche profile (few users compared to GPU vllm) and the
alternative is a backend that crashes at first inference unless the
operator manually sets TORCH_COMPILE_DISABLE=1, which silently disables
all torch.compile optimizations.
Drops `TORCH_COMPILE_DISABLE=1` from tests/e2e/vllm-multinode -- the
smoke now exercises the real compile path through the bundled toolchain.
Test runtime is +20s for the warmup compile, still <90s end to end.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(vllm): scope jetson-ai-lab index to L4T-specific wheels via pyproject.toml
The L4T arm64 build resolves dependencies through pypi.jetson-ai-lab.io,
which hosts the L4T-specific torch / vllm / flash-attn wheels but also
transparently proxies the rest of PyPI through `/+f/<sha>/<filename>`
URLs. With `--extra-index-url` + `--index-strategy=unsafe-best-match`
uv would pick those proxy URLs for ordinary PyPI packages —
anthropic/openai/propcache/annotated-types — and fail when the proxy
503s. Master is hitting the same bug on its own l4t-vllm matrix entry.
Switch the l4t13 install path to a pyproject.toml that marks the
jetson-ai-lab index `explicit = true` and pins only torch, torchvision,
torchaudio, flash-attn, and vllm to it via [tool.uv.sources]. uv won't
consult the L4T mirror for anything else, so transitive deps fall back
to PyPI as the default index — no exposure to the proxy 503s.
`uv pip install -r requirements.txt` ignores [tool.uv.sources], so the
l4t13 branch in install.sh now invokes `uv pip install --requirement
pyproject.toml` directly, replacing the old requirements-l4t13*.txt
files. Other BUILD_PROFILEs continue using libbackend.sh's
installRequirements and never read pyproject.toml.
Local resolution test (x86_64, dry-run) confirms uv hits the L4T
index for torch and falls through to PyPI for everything else.
Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
This commit is contained in:
committed by
GitHub
parent
503904d311
commit
8e43842175
24
Makefile
24
Makefile
@@ -232,6 +232,20 @@ run-e2e-aio: protogen-go
|
||||
@echo 'Running e2e AIO tests'
|
||||
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e-aio
|
||||
|
||||
# vLLM multi-node DP smoke (CPU). Builds local-ai:tests and the
|
||||
# cpu-vllm backend from the current working tree, then drives a
|
||||
# head + headless follower via testcontainers-go and asserts a chat
|
||||
# completion. BuildKit caches both images, so re-runs only rebuild
|
||||
# what changed. The test lives under tests/e2e/distributed and is
|
||||
# selected by the VLLMMultinode label so it doesn't run alongside
|
||||
# the other distributed-suite tests by default.
|
||||
test-e2e-vllm-multinode: docker-build-e2e extract-backend-vllm protogen-go
|
||||
@echo 'Running e2e vLLM multi-node DP test'
|
||||
LOCALAI_IMAGE=local-ai \
|
||||
LOCALAI_IMAGE_TAG=tests \
|
||||
LOCALAI_VLLM_BACKEND_DIR=$(abspath ./local-backends/vllm) \
|
||||
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter='VLLMMultinode' -v -r ./tests/e2e/distributed
|
||||
|
||||
########################################################
|
||||
## E2E tests
|
||||
########################################################
|
||||
@@ -319,7 +333,7 @@ local-backends:
|
||||
|
||||
extract-backend-%: docker-build-% local-backends
|
||||
@echo "Extracting backend $*..."
|
||||
@CID=$$(docker create local-ai-backend:$*) && \
|
||||
@CID=$$(docker create --entrypoint=/run.sh local-ai-backend:$*) && \
|
||||
rm -rf local-backends/$* && mkdir -p local-backends/$* && \
|
||||
docker cp $$CID:/ - | tar -xf - -C local-backends/$* && \
|
||||
docker rm $$CID > /dev/null
|
||||
@@ -594,6 +608,14 @@ test-extra-backend-vllm: docker-build-vllm
|
||||
BACKEND_TEST_OPTIONS=tool_parser:hermes \
|
||||
$(MAKE) test-extra-backend
|
||||
|
||||
## vllm multi-node data-parallel smoke test. Runs LocalAI head + a
|
||||
## `local-ai p2p-worker vllm` follower in docker compose against
|
||||
## Qwen2.5-0.5B with data_parallel_size=2. Requires 2 NVIDIA GPUs and
|
||||
## nvidia-container-runtime on the host — vLLM v1's DP coordinator is
|
||||
## not viable on CPU so this cannot run in CI without GPU.
|
||||
test-extra-backend-vllm-multinode:
|
||||
./tests/e2e/vllm-multinode/smoke.sh
|
||||
|
||||
## tinygrad mirrors the vllm target (same model, same caps, same parser) so
|
||||
## the two backends are directly comparable. The LLM path covers Predict,
|
||||
## streaming and native tool-call extraction. Companion targets below cover
|
||||
|
||||
@@ -18,12 +18,15 @@ else
|
||||
source $backend_dir/../common/libbackend.sh
|
||||
fi
|
||||
|
||||
# This is here because the Intel pip index is broken and returns 200 status codes for every package name, it just doesn't return any package links.
|
||||
# This makes uv think that the package exists in the Intel pip index, and by default it stops looking at other pip indexes once it finds a match.
|
||||
# We need uv to continue falling through to the pypi default index to find optimum[openvino] in the pypi index
|
||||
# the --upgrade actually allows us to *downgrade* torch to the version provided in the Intel pip index
|
||||
# Intel XPU: torch==2.11.0+xpu lives on the PyTorch XPU index, transitive
|
||||
# deps on PyPI — unsafe-best-match lets uv mix both. vllm-xpu-kernels only
|
||||
# ships a python3.12 wheel per upstream docs, so bump the portable Python
|
||||
# before installRequirements (matches the l4t13 pattern below).
|
||||
# https://github.com/vllm-project/vllm/blob/main/docs/getting_started/installation/gpu.xpu.inc.md
|
||||
if [ "x${BUILD_PROFILE}" == "xintel" ]; then
|
||||
EXTRA_PIP_INSTALL_FLAGS+=" --upgrade --index-strategy=unsafe-first-match"
|
||||
PYTHON_VERSION="3.12"
|
||||
PYTHON_PATCH="11"
|
||||
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
|
||||
fi
|
||||
|
||||
# CPU builds need unsafe-best-match to pull torch==2.10.0+cpu from the
|
||||
@@ -42,10 +45,12 @@ fi
|
||||
|
||||
# JetPack 7 / L4T arm64 wheels (torch, vllm, flash-attn) live on
|
||||
# pypi.jetson-ai-lab.io and are built for cp312, so bump the venv Python
|
||||
# accordingly. JetPack 6 keeps cp310 + USE_PIP=true. unsafe-best-match
|
||||
# is required because the jetson-ai-lab index lists transitive deps at
|
||||
# limited versions — without it uv pins to the first matching index and
|
||||
# fails to resolve a compatible wheel from PyPI.
|
||||
# accordingly. JetPack 6 keeps cp310 + USE_PIP=true.
|
||||
#
|
||||
# l4t13 uses pyproject.toml (see the elif branch below) to pin only the
|
||||
# L4T-specific wheels to the jetson-ai-lab index via [tool.uv.sources].
|
||||
# That keeps PyPI as the resolution path for transitive deps like
|
||||
# anthropic/openai/propcache, which the L4T mirror's proxy 503s on.
|
||||
if [ "x${BUILD_PROFILE}" == "xl4t12" ]; then
|
||||
USE_PIP=true
|
||||
fi
|
||||
@@ -53,16 +58,77 @@ if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
|
||||
PYTHON_VERSION="3.12"
|
||||
PYTHON_PATCH="12"
|
||||
PY_STANDALONE_TAG="20251120"
|
||||
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
|
||||
fi
|
||||
|
||||
# Intel XPU has no upstream-published vllm wheels, so we always build vllm
|
||||
# from source against torch-xpu and replace the default triton with
|
||||
# triton-xpu (matching torch 2.11). Mirrors the upstream procedure:
|
||||
# https://github.com/vllm-project/vllm/blob/main/docs/getting_started/installation/gpu.xpu.inc.md
|
||||
if [ "x${BUILD_TYPE}" == "xintel" ]; then
|
||||
# Hide requirements-intel-after.txt so installRequirements doesn't
|
||||
# try `pip install vllm` (would either fail or grab a non-XPU wheel).
|
||||
_intel_after="${backend_dir}/requirements-intel-after.txt"
|
||||
_intel_after_bak=""
|
||||
if [ -f "${_intel_after}" ]; then
|
||||
_intel_after_bak="${_intel_after}.xpu.bak"
|
||||
mv "${_intel_after}" "${_intel_after_bak}"
|
||||
fi
|
||||
installRequirements
|
||||
if [ -n "${_intel_after_bak}" ]; then
|
||||
mv "${_intel_after_bak}" "${_intel_after}"
|
||||
fi
|
||||
|
||||
# vllm's CMake build needs the Intel oneAPI dpcpp/sycl compiler — the
|
||||
# base image (intel/oneapi-basekit) has it but the env isn't sourced.
|
||||
if [ -f /opt/intel/oneapi/setvars.sh ]; then
|
||||
set +u
|
||||
source /opt/intel/oneapi/setvars.sh --force
|
||||
set -u
|
||||
fi
|
||||
|
||||
_vllm_src=$(mktemp -d)
|
||||
trap 'rm -rf "${_vllm_src}"' EXIT
|
||||
git clone --depth 1 https://github.com/vllm-project/vllm "${_vllm_src}/vllm"
|
||||
pushd "${_vllm_src}/vllm"
|
||||
# Install vllm's own runtime deps (torch-xpu, vllm_xpu_kernels,
|
||||
# pydantic, fastapi, …) from upstream's requirements/xpu.txt — the
|
||||
# canonical source of truth. Avoids re-pinning everything ourselves.
|
||||
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} -r requirements/xpu.txt
|
||||
# Stock triton (NVIDIA-only) may have come in transitively; replace
|
||||
# with triton-xpu==3.7.0 which matches torch 2.11.
|
||||
uv pip uninstall triton triton-xpu 2>/dev/null || true
|
||||
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} \
|
||||
--extra-index-url https://download.pytorch.org/whl/xpu \
|
||||
triton-xpu==3.7.0
|
||||
export CMAKE_PREFIX_PATH="$(python -c 'import site; print(site.getsitepackages()[0])'):${CMAKE_PREFIX_PATH:-}"
|
||||
VLLM_TARGET_DEVICE=xpu uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --no-deps .
|
||||
popd
|
||||
# L4T arm64 (JetPack 7): drive the install through pyproject.toml so that
|
||||
# [tool.uv.sources] can pin torch/vllm/flash-attn/torchvision/torchaudio
|
||||
# to the jetson-ai-lab index, while everything else (transitive deps and
|
||||
# PyPI-resolvable packages like transformers) comes from PyPI. Bypasses
|
||||
# installRequirements because uv pip install -r requirements.txt does not
|
||||
# honor sources — see backend/python/vllm/pyproject.toml for the rationale.
|
||||
elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
|
||||
ensureVenv
|
||||
if [ "x${PORTABLE_PYTHON}" == "xtrue" ]; then
|
||||
export C_INCLUDE_PATH="${C_INCLUDE_PATH:-}:$(_portable_dir)/include/python${PYTHON_VERSION}"
|
||||
fi
|
||||
pushd "${backend_dir}"
|
||||
# Build deps first (matches installRequirements' requirements-install.txt
|
||||
# pass — fastsafetensors and friends need pybind11 in the venv before
|
||||
# their sdists can build under --no-build-isolation).
|
||||
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} -r requirements-install.txt
|
||||
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --requirement pyproject.toml
|
||||
popd
|
||||
runProtogen
|
||||
# FROM_SOURCE=true on a CPU build skips the prebuilt vllm wheel in
|
||||
# requirements-cpu-after.txt and compiles vllm locally against the host's
|
||||
# actual CPU. Not used by default because it takes ~30-40 minutes, but
|
||||
# kept here for hosts where the prebuilt wheel SIGILLs (CPU without the
|
||||
# required SIMD baseline, e.g. AVX-512 VNNI/BF16). Default CI uses a
|
||||
# bigger-runner with compatible hardware instead.
|
||||
if [ "x${BUILD_TYPE}" == "x" ] && [ "x${FROM_SOURCE:-}" == "xtrue" ]; then
|
||||
elif [ "x${BUILD_TYPE}" == "x" ] && [ "x${FROM_SOURCE:-}" == "xtrue" ]; then
|
||||
# Temporarily hide the prebuilt wheel so installRequirements doesn't
|
||||
# pull it — the rest of the requirements files (base deps, torch,
|
||||
# transformers) are still installed normally.
|
||||
|
||||
@@ -45,5 +45,109 @@ copy_with_symlinks() {
|
||||
copy_with_symlinks libnuma.so.1
|
||||
copy_with_symlinks libgomp.so.1
|
||||
|
||||
# CPU profile only: bundle a g++ toolchain so torch._inductor's
|
||||
# ISA probe (always run at vllm engine startup, regardless of
|
||||
# enforce_eager) finds a C++ compiler. The LocalAI runtime image
|
||||
# is FROM ubuntu:24.04 with a minimal apt list that does not
|
||||
# include build-essential, and the backend image itself is FROM
|
||||
# scratch -- so without this, cpu-vllm crashes with
|
||||
# torch._inductor.exc.InvalidCxxCompiler at first inference
|
||||
# unless the operator manually sets TORCH_COMPILE_DISABLE=1.
|
||||
#
|
||||
# We snapshot every file owned by the toolchain packages, mirroring
|
||||
# the /usr/... layout into ${BACKEND}/toolchain/ so g++ can find
|
||||
# cc1plus, headers, libs etc. via GCC_EXEC_PREFIX / CPATH /
|
||||
# LIBRARY_PATH at runtime (libbackend.sh wires those up). Adds
|
||||
# ~400 MB to the cpu-vllm image, which is tolerable -- cpu-vllm is
|
||||
# already a niche profile.
|
||||
if [ "${BUILD_TYPE:-}" = "" ] && command -v dpkg-query >/dev/null 2>&1; then
|
||||
TOOLCHAIN_DIR="${CURDIR}/toolchain"
|
||||
mkdir -p "${TOOLCHAIN_DIR}"
|
||||
# The unversioned g++/gcc packages on Debian/Ubuntu only ship
|
||||
# symlinks; the actual binaries live in g++-${VER}/gcc-${VER}.
|
||||
# Discover the active version so the symlink targets get bundled
|
||||
# along with their owners.
|
||||
GCC_VER=$(gcc -dumpversion 2>/dev/null | cut -d. -f1 || true)
|
||||
# `g++-${VER}` itself is just another symlink layer on Debian/
|
||||
# Ubuntu — the real binary `x86_64-linux-gnu-g++-${VER}` lives
|
||||
# in `g++-${VER}-x86-64-linux-gnu` (a separate package pulled in
|
||||
# as a dependency). Same story for gcc/cpp. Compute the dpkg
|
||||
# arch-triplet to find the right package name for both amd64 and
|
||||
# arm64 hosts.
|
||||
case "$(dpkg --print-architecture 2>/dev/null)" in
|
||||
amd64) HOST_TRIPLET="x86-64-linux-gnu" ;;
|
||||
arm64) HOST_TRIPLET="aarch64-linux-gnu" ;;
|
||||
*) HOST_TRIPLET="" ;;
|
||||
esac
|
||||
PKGS=(g++ gcc cpp libstdc++-${GCC_VER}-dev libgcc-${GCC_VER}-dev libc6 libc6-dev binutils binutils-common libbinutils libc-dev-bin linux-libc-dev libcrypt-dev libgomp1 libstdc++6 libgcc-s1 libisl23 libmpc3 libmpfr6 libjansson4 libctf0 libctf-nobfd0 libsframe1)
|
||||
if [ -n "${GCC_VER}" ]; then
|
||||
PKGS+=("g++-${GCC_VER}" "gcc-${GCC_VER}" "cpp-${GCC_VER}" "gcc-${GCC_VER}-base")
|
||||
if [ -n "${HOST_TRIPLET}" ]; then
|
||||
PKGS+=(
|
||||
"g++-${GCC_VER}-${HOST_TRIPLET}"
|
||||
"gcc-${GCC_VER}-${HOST_TRIPLET}"
|
||||
"cpp-${GCC_VER}-${HOST_TRIPLET}"
|
||||
"binutils-${HOST_TRIPLET}"
|
||||
)
|
||||
fi
|
||||
fi
|
||||
for pkg in "${PKGS[@]}"; do
|
||||
if ! dpkg-query -W "${pkg}" >/dev/null 2>&1; then
|
||||
continue
|
||||
fi
|
||||
# Copy each owned path, preserving symlinks and mode. We
|
||||
# tolerate dpkg listing directories alongside files.
|
||||
dpkg -L "${pkg}" | while IFS= read -r path; do
|
||||
if [ -L "${path}" ] || [ -f "${path}" ]; then
|
||||
mkdir -p "${TOOLCHAIN_DIR}$(dirname "${path}")"
|
||||
cp -aP "${path}" "${TOOLCHAIN_DIR}${path}" 2>/dev/null || true
|
||||
fi
|
||||
done
|
||||
done
|
||||
# Ubuntu's filesystem layout has /lib -> /usr/lib (UsrMerge) and
|
||||
# /lib64 -> /usr/lib64. ld scripts (e.g. libm.so) hardcode
|
||||
# `/lib/x86_64-linux-gnu/libm.so.6`; with --sysroot the linker
|
||||
# looks for that path under the sysroot, which means we need
|
||||
# the same symlinks under TOOLCHAIN_DIR.
|
||||
[ -e "${TOOLCHAIN_DIR}/lib" ] || ln -s usr/lib "${TOOLCHAIN_DIR}/lib"
|
||||
[ -e "${TOOLCHAIN_DIR}/lib64" ] || ln -s usr/lib64 "${TOOLCHAIN_DIR}/lib64"
|
||||
|
||||
# Replace the unversioned g++/gcc/cpp symlinks with wrapper
|
||||
# scripts that pass --sysroot=<toolchain> and -B <gcc-exec-prefix>.
|
||||
# Without these flags gcc would fall back to its compiled-in
|
||||
# /usr search and fail to find headers (the runtime image has no
|
||||
# libc6-dev) or fail to invoke `as`/`ld` (binutils not on PATH at
|
||||
# /usr/bin). Wrappers self-resolve their location at runtime so
|
||||
# they work from any BackendsPath.
|
||||
BIN_DIR="${TOOLCHAIN_DIR}/usr/bin"
|
||||
if [ -n "${GCC_VER}" ] && [ -n "${HOST_TRIPLET}" ]; then
|
||||
# HOST_TRIPLET in package names uses dashes ("x86-64-linux-gnu");
|
||||
# the binary suffix uses underscores in the arch part
|
||||
# ("x86_64-linux-gnu-g++-13"). Translate.
|
||||
BIN_TRIPLET=${HOST_TRIPLET//x86-64/x86_64}
|
||||
for tool in g++ gcc cpp; do
|
||||
real="${BIN_DIR}/${BIN_TRIPLET}-${tool}-${GCC_VER}"
|
||||
if [ -x "${real}" ]; then
|
||||
rm -f "${BIN_DIR}/${tool}" "${BIN_DIR}/${tool}-${GCC_VER}"
|
||||
cat > "${BIN_DIR}/${tool}" <<EOF
|
||||
#!/bin/bash
|
||||
# Auto-generated by package.sh. Passes --sysroot and -B so the
|
||||
# bundled toolchain works from any BackendsPath without depending
|
||||
# on libc6-dev / binutils being installed at /usr in the runtime
|
||||
# image. See backend/python/vllm/package.sh.
|
||||
DIR="\$(dirname "\$(readlink -f "\$0")")" # …/toolchain/usr/bin
|
||||
SYSROOT="\$(dirname "\$(dirname "\${DIR}")")" # …/toolchain
|
||||
exec "\${DIR}/${BIN_TRIPLET}-${tool}-${GCC_VER}" \\
|
||||
-B "\${SYSROOT}/usr/lib/gcc/${BIN_TRIPLET}/${GCC_VER}/" \\
|
||||
--sysroot="\${SYSROOT}" \\
|
||||
"\$@"
|
||||
EOF
|
||||
chmod +x "${BIN_DIR}/${tool}"
|
||||
fi
|
||||
done
|
||||
fi
|
||||
echo "Bundled g++ toolchain (gcc-${GCC_VER}) into ${TOOLCHAIN_DIR} ($(du -sh "${TOOLCHAIN_DIR}" | cut -f1))"
|
||||
fi
|
||||
|
||||
echo "vllm packaging completed successfully"
|
||||
ls -liah "${LIB_DIR}/"
|
||||
|
||||
61
backend/python/vllm/pyproject.toml
Normal file
61
backend/python/vllm/pyproject.toml
Normal file
@@ -0,0 +1,61 @@
|
||||
# L4T arm64 (JetPack 7 / sbsa cu130) install spec for the vllm backend.
|
||||
#
|
||||
# Why this file exists, and why only the l4t13 BUILD_PROFILE consumes it:
|
||||
#
|
||||
# pypi.jetson-ai-lab.io hosts the L4T-specific torch / vllm / flash-attn
|
||||
# wheels we need on aarch64 + cuda13, but it ALSO transparently proxies the
|
||||
# rest of PyPI through `/+f/<sha>/<filename>` URLs that 503 frequently. With
|
||||
# `--extra-index-url` + `--index-strategy=unsafe-best-match` (the historical
|
||||
# fix in install.sh) uv would pick those proxy URLs for ordinary PyPI
|
||||
# packages — `anthropic`, `openai`, `propcache`, `annotated-types` — and
|
||||
# trip on the 503s. See e.g. CI run 25212201349 (anthropic-0.97.0).
|
||||
#
|
||||
# `explicit = true` on the index makes uv consult the L4T mirror ONLY for
|
||||
# packages mapped under [tool.uv.sources]. Everything else goes to PyPI.
|
||||
# This breaks the historical 503 path without losing access to the L4T
|
||||
# wheels we actually need from there.
|
||||
#
|
||||
# `uv pip install -r requirements.txt` does NOT honor [tool.uv.sources]
|
||||
# (sources are project-mode only, not pip-compat mode), so install.sh's
|
||||
# l4t13 branch invokes `uv pip install --requirement pyproject.toml`
|
||||
# directly. Other BUILD_PROFILEs continue to use the requirements-*.txt
|
||||
# pipeline through libbackend.sh's installRequirements and never read
|
||||
# this file.
|
||||
[project]
|
||||
name = "localai-vllm-l4t13"
|
||||
version = "0.0.0"
|
||||
requires-python = ">=3.12,<3.13"
|
||||
dependencies = [
|
||||
# Mirror of requirements.txt — kept in sync manually for now since the
|
||||
# l4t13 path bypasses installRequirements (see install.sh).
|
||||
"grpcio==1.80.0",
|
||||
"protobuf",
|
||||
"certifi",
|
||||
"setuptools",
|
||||
"pillow",
|
||||
"charset-normalizer>=3.4.0",
|
||||
"chardet",
|
||||
# L4T-specific accelerator stack (sourced from jetson-ai-lab below).
|
||||
"torch",
|
||||
"torchvision",
|
||||
"torchaudio",
|
||||
"flash-attn",
|
||||
"vllm",
|
||||
# PyPI-resolvable packages that complete the runtime — accelerate,
|
||||
# transformers, bitsandbytes carry their own wheels for aarch64.
|
||||
"accelerate",
|
||||
"transformers",
|
||||
"bitsandbytes",
|
||||
]
|
||||
|
||||
[[tool.uv.index]]
|
||||
name = "jetson-ai-lab"
|
||||
url = "https://pypi.jetson-ai-lab.io/sbsa/cu130"
|
||||
explicit = true
|
||||
|
||||
[tool.uv.sources]
|
||||
torch = { index = "jetson-ai-lab" }
|
||||
torchvision = { index = "jetson-ai-lab" }
|
||||
torchaudio = { index = "jetson-ai-lab" }
|
||||
flash-attn = { index = "jetson-ai-lab" }
|
||||
vllm = { index = "jetson-ai-lab" }
|
||||
@@ -1 +1,3 @@
|
||||
vllm
|
||||
# Intel XPU has no upstream-published vllm wheels — install.sh builds vllm
|
||||
# from source with VLLM_TARGET_DEVICE=xpu and hides this file during
|
||||
# installRequirements. Don't add a `vllm` line here.
|
||||
|
||||
@@ -1,7 +1,8 @@
|
||||
--extra-index-url https://download.pytorch.org/whl/xpu
|
||||
# vllm's own deps (torch==2.11.0+xpu, vllm_xpu_kernels, pydantic, …) are
|
||||
# installed from upstream's requirements/xpu.txt during the source build —
|
||||
# see install.sh. Only list what LocalAI's vllm backend.py needs directly.
|
||||
accelerate
|
||||
torch
|
||||
transformers
|
||||
optimum[openvino]
|
||||
bitsandbytes
|
||||
setuptools
|
||||
bitsandbytes
|
||||
@@ -1,2 +0,0 @@
|
||||
--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
|
||||
vllm
|
||||
@@ -1,8 +0,0 @@
|
||||
--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
|
||||
accelerate
|
||||
torch
|
||||
torchvision
|
||||
torchaudio
|
||||
transformers
|
||||
bitsandbytes
|
||||
flash-attn
|
||||
@@ -1,4 +1,7 @@
|
||||
grpcio==1.80.0
|
||||
protobuf
|
||||
certifi
|
||||
setuptools
|
||||
setuptools
|
||||
pillow
|
||||
charset-normalizer>=3.4.0
|
||||
chardet
|
||||
@@ -1,4 +1,5 @@
|
||||
#!/bin/bash
|
||||
set -x
|
||||
|
||||
backend_dir=$(dirname $0)
|
||||
|
||||
@@ -8,4 +9,41 @@ else
|
||||
source $backend_dir/../common/libbackend.sh
|
||||
fi
|
||||
|
||||
startBackend $@
|
||||
# CPU profile: torch._inductor's ISA-probe (run at vllm engine
|
||||
# startup, even with enforce_eager=True) shells out to g++. The
|
||||
# LocalAI runtime image and the FROM-scratch backend image both
|
||||
# omit a compiler; package.sh bundles one into ${EDIR}/toolchain
|
||||
# along with wrapper scripts at toolchain/usr/bin that already pass
|
||||
# --sysroot and -B. So all run.sh has to do is put the wrapper on
|
||||
# PATH and expose the toolchain's shared libs (libisl, libmpc, libbfd,
|
||||
# ...) to ld.so. No-op for other profiles -- the dir doesn't exist.
|
||||
if [ -d "${EDIR}/toolchain/usr/bin" ]; then
|
||||
export PATH="${EDIR}/toolchain/usr/bin:${PATH}"
|
||||
_libpath="${EDIR}/toolchain/usr/lib/x86_64-linux-gnu"
|
||||
export LD_LIBRARY_PATH="${_libpath}${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
|
||||
fi
|
||||
|
||||
# Multi-node DP follower mode: when the first arg is `serve`, exec into
|
||||
# vllm's own CLI instead of LocalAI's backend.py gRPC server. The
|
||||
# follower speaks ZMQ directly to the head node's vllm ranks — there
|
||||
# is no LocalAI gRPC on the follower side. Reaches this path via
|
||||
# `local-ai p2p-worker vllm`.
|
||||
if [ "${1:-}" = "serve" ]; then
|
||||
ensureVenv
|
||||
if [ "x${PORTABLE_PYTHON}" == "xtrue" ] || [ -x "$(_portable_python)" ]; then
|
||||
_makeVenvPortable --update-pyvenv-cfg
|
||||
fi
|
||||
if [ -d "${EDIR}/lib" ]; then
|
||||
export LD_LIBRARY_PATH="${EDIR}/lib:${LD_LIBRARY_PATH:-}"
|
||||
fi
|
||||
# Run the vllm console script through the venv python rather than
|
||||
# exec-ing it directly. uv bakes an absolute shebang at install time
|
||||
# (e.g. `#!/vllm/venv/bin/python3` from the build image) which
|
||||
# doesn't exist once the backend is relocated to BackendsPath, and
|
||||
# _makeVenvPortable's shebang rewriter only matches paths that
|
||||
# already point at ${EDIR}. Invoking python with the script as an
|
||||
# argument bypasses the shebang entirely.
|
||||
exec "${EDIR}/venv/bin/python" "${EDIR}/venv/bin/vllm" "$@"
|
||||
fi
|
||||
|
||||
startBackend $@
|
||||
|
||||
20
core/cli/worker/labels.go
Normal file
20
core/cli/worker/labels.go
Normal file
@@ -0,0 +1,20 @@
|
||||
package worker
|
||||
|
||||
import "strings"
|
||||
|
||||
// ParseNodeLabels parses a comma-separated `k=v,k=v` string into a map.
|
||||
// Whitespace around keys, values, and pairs is trimmed; pairs without
|
||||
// `=` are skipped silently.
|
||||
func ParseNodeLabels(input string) map[string]string {
|
||||
labels := make(map[string]string)
|
||||
if input == "" {
|
||||
return labels
|
||||
}
|
||||
for _, pair := range strings.Split(input, ",") {
|
||||
pair = strings.TrimSpace(pair)
|
||||
if k, v, ok := strings.Cut(pair, "="); ok {
|
||||
labels[strings.TrimSpace(k)] = strings.TrimSpace(v)
|
||||
}
|
||||
}
|
||||
return labels
|
||||
}
|
||||
@@ -8,8 +8,9 @@ type WorkerFlags struct {
|
||||
}
|
||||
|
||||
type Worker struct {
|
||||
P2P P2P `cmd:"" name:"p2p-llama-cpp-rpc" help:"Starts a LocalAI llama.cpp worker in P2P mode (requires a token)"`
|
||||
P2PMLX P2PMLX `cmd:"" name:"p2p-mlx" help:"Starts a LocalAI MLX distributed worker in P2P mode (requires a token)"`
|
||||
LLamaCPP LLamaCPP `cmd:"" name:"llama-cpp-rpc" help:"Starts a llama.cpp worker in standalone mode"`
|
||||
MLXDistributed MLXDistributed `cmd:"" name:"mlx-distributed" help:"Starts an MLX distributed worker in standalone mode (requires --hostfile and --rank)"`
|
||||
P2P P2P `cmd:"" name:"p2p-llama-cpp-rpc" help:"Starts a LocalAI llama.cpp worker in P2P mode (requires a token)"`
|
||||
P2PMLX P2PMLX `cmd:"" name:"p2p-mlx" help:"Starts a LocalAI MLX distributed worker in P2P mode (requires a token)"`
|
||||
LLamaCPP LLamaCPP `cmd:"" name:"llama-cpp-rpc" help:"Starts a llama.cpp worker in standalone mode"`
|
||||
MLXDistributed MLXDistributed `cmd:"" name:"mlx-distributed" help:"Starts an MLX distributed worker in standalone mode (requires --hostfile and --rank)"`
|
||||
VLLMDistributed VLLMDistributed `cmd:"" name:"vllm" help:"Starts a vLLM data-parallel follower process. Multi-node DP for a single model: head runs the existing vllm backend with engine_args.data_parallel_size>1, followers run this command."`
|
||||
}
|
||||
|
||||
58
core/cli/worker/worker_backend_common.go
Normal file
58
core/cli/worker/worker_backend_common.go
Normal file
@@ -0,0 +1,58 @@
|
||||
package worker
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"fmt"
|
||||
"path/filepath"
|
||||
|
||||
"github.com/mudler/LocalAI/core/config"
|
||||
"github.com/mudler/LocalAI/core/gallery"
|
||||
"github.com/mudler/LocalAI/pkg/model"
|
||||
"github.com/mudler/LocalAI/pkg/system"
|
||||
"github.com/mudler/xlog"
|
||||
)
|
||||
|
||||
// findBackendPath resolves the directory containing a backend's run.sh,
|
||||
// installing the backend from the gallery if it isn't present.
|
||||
// `name` is the gallery entry name (for vLLM the meta entry "vllm"
|
||||
// resolves to a platform-specific package via capability lookup).
|
||||
func findBackendPath(name, galleries string, systemState *system.SystemState) (string, error) {
|
||||
backends, err := gallery.ListSystemBackends(systemState)
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
if backend, ok := backends.Get(name); ok {
|
||||
return runFileDir(backend.RunFile)
|
||||
}
|
||||
|
||||
ml := model.NewModelLoader(systemState)
|
||||
var gals []config.Gallery
|
||||
if err := json.Unmarshal([]byte(galleries), &gals); err != nil {
|
||||
xlog.Error("failed loading galleries", "error", err)
|
||||
return "", err
|
||||
}
|
||||
if err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, name, nil, true); err != nil {
|
||||
xlog.Error("backend not found, failed to install it", "name", name, "error", err)
|
||||
return "", err
|
||||
}
|
||||
|
||||
backends, err = gallery.ListSystemBackends(systemState)
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
backend, ok := backends.Get(name)
|
||||
if !ok {
|
||||
return "", fmt.Errorf("%s backend not found after install", name)
|
||||
}
|
||||
return runFileDir(backend.RunFile)
|
||||
}
|
||||
|
||||
func runFileDir(runFile string) (string, error) {
|
||||
dir := filepath.Dir(runFile)
|
||||
if dir == "" {
|
||||
return "", errors.New("backend has no run.sh, install it first")
|
||||
}
|
||||
return dir, nil
|
||||
}
|
||||
@@ -1,57 +1,16 @@
|
||||
package worker
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
|
||||
"github.com/mudler/LocalAI/core/config"
|
||||
"github.com/mudler/LocalAI/core/gallery"
|
||||
"github.com/mudler/LocalAI/pkg/model"
|
||||
"github.com/mudler/LocalAI/pkg/system"
|
||||
"github.com/mudler/xlog"
|
||||
)
|
||||
|
||||
const mlxDistributedGalleryName = "mlx-distributed"
|
||||
|
||||
// findMLXDistributedBackendPath finds or installs the mlx-distributed backend
|
||||
// and returns the directory containing run.sh.
|
||||
func findMLXDistributedBackendPath(galleries string, systemState *system.SystemState) (string, error) {
|
||||
backends, err := gallery.ListSystemBackends(systemState)
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
|
||||
backend, ok := backends.Get(mlxDistributedGalleryName)
|
||||
if !ok {
|
||||
ml := model.NewModelLoader(systemState)
|
||||
var gals []config.Gallery
|
||||
if err := json.Unmarshal([]byte(galleries), &gals); err != nil {
|
||||
xlog.Error("failed loading galleries", "error", err)
|
||||
return "", err
|
||||
}
|
||||
if err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, mlxDistributedGalleryName, nil, true); err != nil {
|
||||
xlog.Error("mlx-distributed backend not found, failed to install it", "error", err)
|
||||
return "", err
|
||||
}
|
||||
// Re-fetch after install
|
||||
backends, err = gallery.ListSystemBackends(systemState)
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
backend, ok = backends.Get(mlxDistributedGalleryName)
|
||||
if !ok {
|
||||
return "", errors.New("mlx-distributed backend not found after install")
|
||||
}
|
||||
}
|
||||
|
||||
backendPath := filepath.Dir(backend.RunFile)
|
||||
if backendPath == "" {
|
||||
return "", errors.New("mlx-distributed backend not found, install it first")
|
||||
}
|
||||
return backendPath, nil
|
||||
return findBackendPath(mlxDistributedGalleryName, galleries, systemState)
|
||||
}
|
||||
|
||||
// buildMLXCommand builds the exec.Cmd to launch the mlx-distributed backend.
|
||||
|
||||
13
core/cli/worker/worker_suite_test.go
Normal file
13
core/cli/worker/worker_suite_test.go
Normal file
@@ -0,0 +1,13 @@
|
||||
package worker
|
||||
|
||||
import (
|
||||
"testing"
|
||||
|
||||
. "github.com/onsi/ginkgo/v2"
|
||||
. "github.com/onsi/gomega"
|
||||
)
|
||||
|
||||
func TestWorker(t *testing.T) {
|
||||
RegisterFailHandler(Fail)
|
||||
RunSpecs(t, "Worker Suite")
|
||||
}
|
||||
221
core/cli/worker/worker_vllm.go
Normal file
221
core/cli/worker/worker_vllm.go
Normal file
@@ -0,0 +1,221 @@
|
||||
package worker
|
||||
|
||||
import (
|
||||
"cmp"
|
||||
"context"
|
||||
"fmt"
|
||||
"os"
|
||||
"os/exec"
|
||||
"os/signal"
|
||||
"path/filepath"
|
||||
"strconv"
|
||||
"syscall"
|
||||
"time"
|
||||
|
||||
cliContext "github.com/mudler/LocalAI/core/cli/context"
|
||||
"github.com/mudler/LocalAI/core/cli/workerregistry"
|
||||
"github.com/mudler/LocalAI/core/services/nodes"
|
||||
"github.com/mudler/LocalAI/pkg/system"
|
||||
"github.com/mudler/LocalAI/pkg/xsysinfo"
|
||||
"github.com/mudler/xlog"
|
||||
)
|
||||
|
||||
// vLLMFollowerRoleLabel marks a node as a vLLM data-parallel follower.
|
||||
// Operators scope regular models away from these nodes via inverse
|
||||
// selectors like {"!node.role":"vllm-follower"}.
|
||||
const vLLMFollowerRoleLabel = "vllm-follower"
|
||||
|
||||
// VLLMDistributed runs a vLLM follower process for multi-node
|
||||
// data-parallel inference. The head runs LocalAI's existing single-
|
||||
// node vLLM gRPC backend with engine_args.data_parallel_size > 1;
|
||||
// followers run vanilla `vllm serve --headless ...` and speak ZMQ
|
||||
// directly to the head.
|
||||
//
|
||||
// The follower is operator-launched (no NATS / SmartRouter placement
|
||||
// in this iteration). When --register-to is set, the worker self-
|
||||
// registers as an agent-type node so it shows up in the admin UI; a
|
||||
// `node.role=vllm-follower` label discourages model placement on it.
|
||||
type VLLMDistributed struct {
|
||||
WorkerFlags `embed:""`
|
||||
|
||||
// Registration (optional). Without these the worker just runs vLLM
|
||||
// and exits — no UI visibility. With them set, the follower
|
||||
// registers as an agent-type node, heartbeats while vLLM is
|
||||
// running, and deregisters on shutdown.
|
||||
RegisterTo string `env:"LOCALAI_REGISTER_TO" help:"Frontend URL for self-registration. Empty = no registration." group:"registration"`
|
||||
RegistrationToken string `env:"LOCALAI_REGISTRATION_TOKEN" help:"Token for authenticating with the frontend" group:"registration"`
|
||||
NodeName string `env:"LOCALAI_NODE_NAME" help:"Node name for registration (defaults to vllm-<hostname>)" group:"registration"`
|
||||
NodeLabels string `env:"LOCALAI_NODE_LABELS" help:"Comma-separated key=value labels for this node (node.role=vllm-follower is always added)" group:"registration"`
|
||||
HeartbeatInterval string `env:"LOCALAI_HEARTBEAT_INTERVAL" default:"10s" help:"Interval between heartbeats" group:"registration"`
|
||||
|
||||
// vLLM data-parallel placement. The head must advertise the same
|
||||
// data_parallel_size / data_parallel_rpc_port via its engine_args;
|
||||
// followers use --master-addr / --master-port to find it.
|
||||
Model string `arg:"" help:"HuggingFace model ID or local path (must match the head)"`
|
||||
DataParallelSize int `name:"data-parallel-size" env:"VLLM_DATA_PARALLEL_SIZE" required:"" help:"Total DP ranks across all nodes"`
|
||||
DataParallelSizeLocal int `name:"data-parallel-size-local" env:"VLLM_DATA_PARALLEL_SIZE_LOCAL" required:"" help:"DP ranks on this node"`
|
||||
StartRank int `name:"start-rank" env:"VLLM_DATA_PARALLEL_START_RANK" required:"" help:"Starting DP rank for this node (>0 for followers)"`
|
||||
MasterAddr string `name:"master-addr" env:"VLLM_DP_MASTER_ADDR" required:"" help:"Head node IP/hostname for DP RPC handshake"`
|
||||
MasterPort int `name:"master-port" env:"VLLM_DP_MASTER_PORT" required:"" help:"Head node DP RPC port"`
|
||||
Headless bool `env:"VLLM_HEADLESS" default:"true" negatable:"" help:"Headless follower mode (no API server)"`
|
||||
ExtraArgs []string `name:"vllm-arg" env:"VLLM_EXTRA_ARGS" help:"Additional CLI args passed verbatim to vllm serve (e.g. --tensor-parallel-size 2). May be repeated."`
|
||||
}
|
||||
|
||||
func (r *VLLMDistributed) Run(ctx *cliContext.Context) error {
|
||||
// Rank 0 is the head: it must serve the OpenAI API. --headless
|
||||
// disables that, so the combination is operator error and would
|
||||
// silently produce a cluster that can't accept requests.
|
||||
if r.Headless && r.StartRank == 0 {
|
||||
return fmt.Errorf("--start-rank 0 (head) cannot be --headless; the head serves the API")
|
||||
}
|
||||
|
||||
systemState, err := system.GetSystemState(
|
||||
system.WithBackendPath(r.BackendsPath),
|
||||
system.WithBackendSystemPath(r.BackendsSystemPath),
|
||||
)
|
||||
if err != nil {
|
||||
return fmt.Errorf("getting system state: %w", err)
|
||||
}
|
||||
|
||||
backendPath, err := findBackendPath("vllm", r.BackendGalleries, systemState)
|
||||
if err != nil {
|
||||
return fmt.Errorf("cannot find vllm backend: %w", err)
|
||||
}
|
||||
|
||||
args := r.buildVLLMArgs()
|
||||
runSh := filepath.Join(backendPath, "run.sh")
|
||||
|
||||
shutdownCtx, shutdownCancel := context.WithCancel(context.Background())
|
||||
defer shutdownCancel()
|
||||
|
||||
// Self-register so the follower is visible in the admin UI. Done
|
||||
// before vLLM starts so an unreachable frontend fails fast rather
|
||||
// than after the GPU is already loaded.
|
||||
if r.RegisterTo != "" {
|
||||
regClient := &workerregistry.RegistrationClient{
|
||||
FrontendURL: r.RegisterTo,
|
||||
RegistrationToken: r.RegistrationToken,
|
||||
}
|
||||
nodeID, _, regErr := regClient.RegisterWithRetry(context.Background(), r.registrationBody(), 10)
|
||||
if regErr != nil {
|
||||
return fmt.Errorf("registering with frontend: %w", regErr)
|
||||
}
|
||||
xlog.Info("Registered with frontend", "nodeID", nodeID, "frontend", r.RegisterTo, "role", "vllm-follower")
|
||||
|
||||
heartbeatInterval, _ := time.ParseDuration(r.HeartbeatInterval)
|
||||
heartbeatInterval = cmp.Or(heartbeatInterval, 10*time.Second)
|
||||
go regClient.HeartbeatLoop(shutdownCtx, nodeID, heartbeatInterval, r.heartbeatBody)
|
||||
|
||||
defer regClient.GracefulDeregister(nodeID)
|
||||
}
|
||||
|
||||
xlog.Info("Starting vllm follower",
|
||||
"model", r.Model,
|
||||
"data-parallel-size", r.DataParallelSize,
|
||||
"data-parallel-size-local", r.DataParallelSizeLocal,
|
||||
"start-rank", r.StartRank,
|
||||
"master", fmt.Sprintf("%s:%d", r.MasterAddr, r.MasterPort),
|
||||
)
|
||||
|
||||
cmd := exec.CommandContext(shutdownCtx, runSh, args...)
|
||||
// VLLM_DP_* env vars are belt-and-braces alongside the explicit
|
||||
// CLI flags — vLLM honours both (vllm/envs.py:142-148).
|
||||
cmd.Env = append(os.Environ(),
|
||||
fmt.Sprintf("VLLM_DP_MASTER_IP=%s", r.MasterAddr),
|
||||
fmt.Sprintf("VLLM_DP_MASTER_PORT=%d", r.MasterPort),
|
||||
fmt.Sprintf("VLLM_DP_SIZE=%d", r.DataParallelSize),
|
||||
fmt.Sprintf("VLLM_DP_RANK=%d", r.StartRank),
|
||||
"VLLM_DP_RANK_LOCAL=0",
|
||||
)
|
||||
cmd.Stdout = os.Stdout
|
||||
cmd.Stderr = os.Stderr
|
||||
cmd.Stdin = os.Stdin
|
||||
|
||||
// Forward INT/TERM to vLLM so it gets a chance to clean up its ZMQ
|
||||
// sockets. exec.CommandContext kills with SIGKILL on cancellation,
|
||||
// which we want as a fallback only.
|
||||
sigCh := make(chan os.Signal, 1)
|
||||
signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
|
||||
defer signal.Stop(sigCh)
|
||||
|
||||
if err := cmd.Start(); err != nil {
|
||||
return fmt.Errorf("starting vllm: %w", err)
|
||||
}
|
||||
|
||||
waitErr := make(chan error, 1)
|
||||
go func() { waitErr <- cmd.Wait() }()
|
||||
|
||||
for {
|
||||
select {
|
||||
case sig := <-sigCh:
|
||||
xlog.Info("Forwarding signal to vllm", "signal", sig)
|
||||
if cmd.Process != nil {
|
||||
_ = cmd.Process.Signal(sig)
|
||||
}
|
||||
case err := <-waitErr:
|
||||
return err
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// buildVLLMArgs assembles the vLLM CLI argv. Factored out for unit
|
||||
// testing — Run is hard to test without a real backend install.
|
||||
func (r *VLLMDistributed) buildVLLMArgs() []string {
|
||||
args := []string{"serve", r.Model}
|
||||
if r.Headless {
|
||||
args = append(args, "--headless")
|
||||
}
|
||||
args = append(args,
|
||||
"--data-parallel-size", strconv.Itoa(r.DataParallelSize),
|
||||
"--data-parallel-size-local", strconv.Itoa(r.DataParallelSizeLocal),
|
||||
"--data-parallel-start-rank", strconv.Itoa(r.StartRank),
|
||||
"--data-parallel-address", r.MasterAddr,
|
||||
"--data-parallel-rpc-port", strconv.Itoa(r.MasterPort),
|
||||
)
|
||||
args = append(args, r.ExtraArgs...)
|
||||
return args
|
||||
}
|
||||
|
||||
// registrationBody mirrors agent_worker.go's shape: agent-type nodes
|
||||
// don't need an address, which fits a follower that doesn't host any
|
||||
// LocalAI gRPC backends. The node.role label lets operators scope
|
||||
// regular model placement away from followers.
|
||||
func (r *VLLMDistributed) registrationBody() map[string]any {
|
||||
nodeName := r.NodeName
|
||||
if nodeName == "" {
|
||||
hostname, err := os.Hostname()
|
||||
if err != nil {
|
||||
nodeName = fmt.Sprintf("vllm-follower-%d", os.Getpid())
|
||||
} else {
|
||||
nodeName = "vllm-" + hostname
|
||||
}
|
||||
}
|
||||
|
||||
totalVRAM, _ := xsysinfo.TotalAvailableVRAM()
|
||||
gpuVendor, _ := xsysinfo.DetectGPUVendor()
|
||||
|
||||
body := map[string]any{
|
||||
"name": nodeName,
|
||||
"node_type": nodes.NodeTypeAgent,
|
||||
"total_vram": totalVRAM,
|
||||
"available_vram": totalVRAM,
|
||||
"gpu_vendor": gpuVendor,
|
||||
}
|
||||
if r.RegistrationToken != "" {
|
||||
body["token"] = r.RegistrationToken
|
||||
}
|
||||
|
||||
labels := ParseNodeLabels(r.NodeLabels)
|
||||
labels["node.role"] = vLLMFollowerRoleLabel
|
||||
body["labels"] = labels
|
||||
return body
|
||||
}
|
||||
|
||||
func (r *VLLMDistributed) heartbeatBody() map[string]any {
|
||||
body := map[string]any{}
|
||||
aggregate := xsysinfo.GetGPUAggregateInfo()
|
||||
if aggregate.TotalVRAM > 0 {
|
||||
body["available_vram"] = aggregate.FreeVRAM
|
||||
}
|
||||
return body
|
||||
}
|
||||
105
core/cli/worker/worker_vllm_test.go
Normal file
105
core/cli/worker/worker_vllm_test.go
Normal file
@@ -0,0 +1,105 @@
|
||||
package worker
|
||||
|
||||
import (
|
||||
. "github.com/onsi/ginkgo/v2"
|
||||
. "github.com/onsi/gomega"
|
||||
)
|
||||
|
||||
var _ = Describe("VLLMDistributed", func() {
|
||||
Describe("buildVLLMArgs", func() {
|
||||
DescribeTable("produces the expected vLLM CLI argv",
|
||||
func(cmd VLLMDistributed, want []string) {
|
||||
Expect(cmd.buildVLLMArgs()).To(Equal(want))
|
||||
},
|
||||
Entry("headless follower with explicit master",
|
||||
VLLMDistributed{
|
||||
Model: "Qwen/Qwen3.5-1.5B",
|
||||
DataParallelSize: 4,
|
||||
DataParallelSizeLocal: 2,
|
||||
StartRank: 2,
|
||||
MasterAddr: "10.0.0.1",
|
||||
MasterPort: 32100,
|
||||
Headless: true,
|
||||
},
|
||||
[]string{
|
||||
"serve", "Qwen/Qwen3.5-1.5B",
|
||||
"--headless",
|
||||
"--data-parallel-size", "4",
|
||||
"--data-parallel-size-local", "2",
|
||||
"--data-parallel-start-rank", "2",
|
||||
"--data-parallel-address", "10.0.0.1",
|
||||
"--data-parallel-rpc-port", "32100",
|
||||
},
|
||||
),
|
||||
Entry("head-style invocation: rank 0, not headless",
|
||||
VLLMDistributed{
|
||||
Model: "moonshotai/Kimi-K2.6-Instruct",
|
||||
DataParallelSize: 8,
|
||||
DataParallelSizeLocal: 4,
|
||||
StartRank: 0,
|
||||
MasterAddr: "127.0.0.1",
|
||||
MasterPort: 32100,
|
||||
Headless: false,
|
||||
},
|
||||
[]string{
|
||||
"serve", "moonshotai/Kimi-K2.6-Instruct",
|
||||
"--data-parallel-size", "8",
|
||||
"--data-parallel-size-local", "4",
|
||||
"--data-parallel-start-rank", "0",
|
||||
"--data-parallel-address", "127.0.0.1",
|
||||
"--data-parallel-rpc-port", "32100",
|
||||
},
|
||||
),
|
||||
Entry("extra args appended verbatim",
|
||||
VLLMDistributed{
|
||||
Model: "Qwen/Qwen3.5-1.5B",
|
||||
DataParallelSize: 2,
|
||||
DataParallelSizeLocal: 1,
|
||||
StartRank: 1,
|
||||
MasterAddr: "head.local",
|
||||
MasterPort: 32100,
|
||||
Headless: true,
|
||||
ExtraArgs: []string{"--tensor-parallel-size", "2", "--enable-expert-parallel"},
|
||||
},
|
||||
[]string{
|
||||
"serve", "Qwen/Qwen3.5-1.5B",
|
||||
"--headless",
|
||||
"--data-parallel-size", "2",
|
||||
"--data-parallel-size-local", "1",
|
||||
"--data-parallel-start-rank", "1",
|
||||
"--data-parallel-address", "head.local",
|
||||
"--data-parallel-rpc-port", "32100",
|
||||
"--tensor-parallel-size", "2",
|
||||
"--enable-expert-parallel",
|
||||
},
|
||||
),
|
||||
)
|
||||
})
|
||||
|
||||
Describe("registrationBody", func() {
|
||||
// Followers don't host LocalAI gRPC, so node_type must be "agent"
|
||||
// to bypass the address requirement on /api/node/register, and the
|
||||
// node.role label is the contract operators rely on to scope normal
|
||||
// model placement away from these nodes.
|
||||
It("registers as agent-type with the vllm-follower role label", func() {
|
||||
cmd := VLLMDistributed{
|
||||
NodeName: "test-follower",
|
||||
DataParallelSize: 4,
|
||||
DataParallelSizeLocal: 2,
|
||||
StartRank: 2,
|
||||
MasterAddr: "10.0.0.1",
|
||||
NodeLabels: "tier=fast,gpu.vendor=nvidia",
|
||||
}
|
||||
body := cmd.registrationBody()
|
||||
|
||||
Expect(body).To(HaveKeyWithValue("node_type", "agent"))
|
||||
Expect(body).To(HaveKeyWithValue("name", "test-follower"))
|
||||
|
||||
labels, ok := body["labels"].(map[string]string)
|
||||
Expect(ok).To(BeTrue(), "labels must be map[string]string")
|
||||
Expect(labels).To(HaveKeyWithValue("node.role", "vllm-follower"))
|
||||
Expect(labels).To(HaveKeyWithValue("tier", "fast"))
|
||||
Expect(labels).To(HaveKeyWithValue("gpu.vendor", "nvidia"))
|
||||
})
|
||||
})
|
||||
})
|
||||
@@ -305,6 +305,97 @@ MCP servers configured in model configs work in distributed mode. The frontend r
|
||||
- **MCP tool execution** (during `/v1/chat/completions`): tool calls are routed to agent workers via NATS request-reply
|
||||
- **MCP CI jobs**: executed entirely on agent workers with access to docker for stdio-based MCP servers
|
||||
|
||||
## vLLM Multi-Node (Data-Parallel)
|
||||
|
||||
A single vLLM model can span multiple GPU nodes via data parallelism: the head node serves the OpenAI API and runs the local DP ranks, follower nodes run vanilla `vllm serve --headless` and speak ZMQ directly to the head. LocalAI's role is starting the follower processes and surfacing them in the admin UI; the cross-rank tensor traffic is vLLM's own.
|
||||
|
||||
This mode is **operator-launched** — the head config and each follower's invocation must agree on the topology (`data_parallel_size`, `data_parallel_size_local`, `data_parallel_address`, `data_parallel_rpc_port`). The SmartRouter does not place follower ranks automatically.
|
||||
|
||||
### Head node configuration
|
||||
|
||||
The head runs the existing single-node vLLM gRPC backend. Set `engine_args` to publish the DP topology vLLM expects:
|
||||
|
||||
```yaml
|
||||
backend: vllm
|
||||
parameters:
|
||||
model: moonshotai/Kimi-K2.6-Instruct
|
||||
engine_args:
|
||||
data_parallel_size: 4 # total ranks across all nodes
|
||||
data_parallel_size_local: 2 # ranks on the head node
|
||||
data_parallel_address: 10.0.0.1 # head's reachable IP
|
||||
data_parallel_rpc_port: 32100 # any free port; followers connect here
|
||||
enable_expert_parallel: true # for MoE models
|
||||
```
|
||||
|
||||
The head will start its 2 local ranks, listen on `10.0.0.1:32100`, and wait for the remaining 2 ranks to handshake.
|
||||
|
||||
### Follower nodes
|
||||
|
||||
Each follower runs `local-ai p2p-worker vllm` with matching topology, an explicit start rank, and the head's address:
|
||||
|
||||
```bash
|
||||
local-ai p2p-worker vllm \
|
||||
moonshotai/Kimi-K2.6-Instruct \
|
||||
--data-parallel-size 4 \
|
||||
--data-parallel-size-local 2 \
|
||||
--start-rank 2 \
|
||||
--master-addr 10.0.0.1 \
|
||||
--master-port 32100 \
|
||||
--register-to http://frontend:8080 \
|
||||
--registration-token changeme
|
||||
```
|
||||
|
||||
`--register-to` is optional but recommended — it makes the follower visible in the admin UI as an `agent`-type node tagged with `node.role=vllm-follower`. Without it the worker just runs vLLM and exits silently when vLLM does. The role label discourages SmartRouter from placing other models on the follower; pair it with model selectors like `{"!node.role":"vllm-follower"}` if you also run regular LocalAI models on the same fleet.
|
||||
|
||||
### Worked example: 2-node Kimi-K2.6 deployment
|
||||
|
||||
Two A100 nodes (`10.0.0.1`, `10.0.0.2`), 8 GPUs total, `data_parallel_size=8` with 4 ranks per node:
|
||||
|
||||
```yaml
|
||||
# /models/kimi.yaml on the head (10.0.0.1)
|
||||
name: kimi-k2-6
|
||||
backend: vllm
|
||||
parameters:
|
||||
model: moonshotai/Kimi-K2.6-Instruct
|
||||
engine_args:
|
||||
data_parallel_size: 8
|
||||
data_parallel_size_local: 4
|
||||
data_parallel_address: 10.0.0.1
|
||||
data_parallel_rpc_port: 32100
|
||||
enable_expert_parallel: true
|
||||
all2all_backend: deepep_high_throughput
|
||||
```
|
||||
|
||||
```bash
|
||||
# On 10.0.0.2 (follower)
|
||||
local-ai p2p-worker vllm moonshotai/Kimi-K2.6-Instruct \
|
||||
--data-parallel-size 8 --data-parallel-size-local 4 --start-rank 4 \
|
||||
--master-addr 10.0.0.1 --master-port 32100 \
|
||||
--register-to http://10.0.0.1:8080 --registration-token changeme
|
||||
```
|
||||
|
||||
A `curl http://10.0.0.1:8080/v1/chat/completions ...` against the head will then dispatch across all 8 ranks.
|
||||
|
||||
### Intel Arc / XPU notes
|
||||
|
||||
vLLM XPU supports DP (`vllm/platforms/xpu.py:198` handles `world_size_across_dp > 1`; ranks bind to `xpu:{local_rank}` in `xpu_worker.py:62`, with xccl as the collective backend). Each rank still needs a distinct discrete GPU — the iGPU on a hybrid host is not a viable second device.
|
||||
|
||||
Older XE-HPG GPUs (e.g. Arc A770) need to bypass the cutlass attention path:
|
||||
|
||||
```yaml
|
||||
engine_args:
|
||||
attention_backend: TRITON_ATTN
|
||||
```
|
||||
|
||||
`docker-compose.vllm-multinode.intel.yaml` at the repo root is the Intel equivalent of `docker-compose.vllm-multinode.yaml` — uses `/dev/dri` passthrough, `ZE_AFFINITY_MASK` to pin each rank to one device, and `latest-gpu-intel` images. Run via `./tests/e2e/vllm-multinode/smoke.sh --intel`.
|
||||
|
||||
### Caveats
|
||||
|
||||
- **Tensor parallel within a node only.** vLLM v1 does not support TP across nodes; combine `tensor_parallel_size` (within a node, via `engine_args`) with `data_parallel_size` (across nodes).
|
||||
- **Followers don't host LocalAI gRPC.** The follower process is vanilla vLLM, so `/api/backend-logs/<modelId>` does not stream follower output. Use `journalctl` / `kubectl logs` / compose logs for the follower's stderr.
|
||||
- **Network reachability.** The head's `data_parallel_rpc_port` plus a range of ZMQ ports (typically `data_parallel_rpc_port..+N`) must be reachable from every follower. Open them in your firewall / security group.
|
||||
- **Topology must match exactly.** A mismatch in `--data-parallel-size` between head and any follower will hang the handshake. Check the head's vLLM logs for `waiting for N DP ranks` if startup stalls.
|
||||
|
||||
## Scaling
|
||||
|
||||
**Adding worker capacity:** Start additional `worker` instances pointing to the same frontend. They self-register automatically:
|
||||
|
||||
@@ -705,6 +705,22 @@ specific target models; pick the one that matches your target. The
|
||||
drafter loads in its native precision regardless of the target's
|
||||
`quantization:` setting.
|
||||
|
||||
Another example — picking a non-default attention backend (e.g. on
|
||||
hardware where the default cutlass kernels aren't supported):
|
||||
|
||||
```yaml
|
||||
engine_args:
|
||||
attention_backend: TRITON_ATTN
|
||||
```
|
||||
|
||||
#### Multi-node data parallelism
|
||||
|
||||
`engine_args.data_parallel_size > 1` combined with the
|
||||
`local-ai p2p-worker vllm` follower lets a single model span multiple
|
||||
GPU nodes. See [vLLM Multi-Node (Data-Parallel)]({{% relref
|
||||
"/docs/features/distributed-mode#vllm-multi-node-data-parallel" %}})
|
||||
for the head/follower configuration and a worked Kimi-K2.6 example.
|
||||
|
||||
### Transformers
|
||||
|
||||
[Transformers](https://huggingface.co/docs/transformers/index) is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX.
|
||||
|
||||
245
tests/e2e/distributed/vllm_multinode_test.go
Normal file
245
tests/e2e/distributed/vllm_multinode_test.go
Normal file
@@ -0,0 +1,245 @@
|
||||
package distributed_test
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io"
|
||||
"net/http"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"runtime"
|
||||
"time"
|
||||
|
||||
. "github.com/onsi/ginkgo/v2"
|
||||
. "github.com/onsi/gomega"
|
||||
|
||||
"github.com/testcontainers/testcontainers-go"
|
||||
"github.com/testcontainers/testcontainers-go/network"
|
||||
"github.com/testcontainers/testcontainers-go/wait"
|
||||
)
|
||||
|
||||
// vLLM data-parallel deployment config served by the head. KV cache is
|
||||
// trimmed because the CPU smoke runs two engines on one box and the
|
||||
// prebuilt wheel auto-sizes KV to fill RAM otherwise.
|
||||
const qwenDPYAML = `name: qwen-dp
|
||||
backend: vllm
|
||||
parameters:
|
||||
model: Qwen/Qwen2.5-0.5B-Instruct
|
||||
context_size: 512
|
||||
trust_remote_code: true
|
||||
template:
|
||||
use_tokenizer_template: true
|
||||
engine_args:
|
||||
data_parallel_size: 2
|
||||
data_parallel_size_local: 1
|
||||
data_parallel_address: localai-head
|
||||
data_parallel_rpc_port: 32100
|
||||
enforce_eager: true
|
||||
max_model_len: 512
|
||||
`
|
||||
|
||||
// End-to-end smoke for `local-ai p2p-worker vllm`. Two containers from
|
||||
// the locally-built `local-ai:tests` image — head + headless follower
|
||||
// — share a docker network and a backend bind-mount (so the cpu-vllm
|
||||
// backend extracted by `make extract-backend-vllm` is seen as a system
|
||||
// backend, no gallery fetch). DP=2 on a 0.5B model on CPU; the test
|
||||
// asserts /readyz comes up across both ranks and a chat completion
|
||||
// returns non-empty content.
|
||||
//
|
||||
// Required preconditions (the `test-e2e-vllm-multinode` Make target
|
||||
// sets these up):
|
||||
// - `local-ai:tests` image built (docker-build-e2e)
|
||||
// - `local-backends/vllm/` populated (extract-backend-vllm)
|
||||
// - LOCALAI_VLLM_BACKEND_DIR env var pointing at the extracted dir
|
||||
var _ = Describe("vLLM multi-node DP on CPU", Ordered, Label("Distributed", "VLLMMultinode"), func() {
|
||||
var baseURL string
|
||||
|
||||
BeforeAll(func() {
|
||||
ctx := context.Background()
|
||||
|
||||
image := vllmEnvOrDefault("LOCALAI_IMAGE", "local-ai")
|
||||
tag := vllmEnvOrDefault("LOCALAI_IMAGE_TAG", "tests")
|
||||
imageRef := fmt.Sprintf("%s:%s", image, tag)
|
||||
|
||||
// LOCALAI_VLLM_BACKEND_DIR is set by the dedicated
|
||||
// `make test-e2e-vllm-multinode` target. The general
|
||||
// `make test-e2e` target picks this file up too via
|
||||
// `ginkgo -r ./tests/e2e`; in that context skip rather
|
||||
// than fail.
|
||||
backendDir := os.Getenv("LOCALAI_VLLM_BACKEND_DIR")
|
||||
if backendDir == "" {
|
||||
Skip("LOCALAI_VLLM_BACKEND_DIR not set — run `make test-e2e-vllm-multinode`")
|
||||
}
|
||||
Expect(filepath.Join(backendDir, "run.sh")).To(BeAnExistingFile(),
|
||||
"extracted backend missing run.sh — check the extract-backend-vllm output")
|
||||
|
||||
// State dir for the head: holds qwen-dp.yaml and is also where
|
||||
// LocalAI redirects HF_HOME for backend subprocesses
|
||||
// (pkg/model/initializers.go:76), so Qwen weights accumulate
|
||||
// here. Stable gitignored path under local-backends/ so the
|
||||
// container's root-owned writes don't trip Ginkgo's TempDir
|
||||
// cleanup, and successive runs reuse the ~1 GB download.
|
||||
configDir := filepath.Join(thisFileDir(), "..", "..", "..", "local-backends", "vllm-multinode-state")
|
||||
Expect(os.MkdirAll(configDir, 0o755)).To(Succeed())
|
||||
Expect(os.WriteFile(filepath.Join(configDir, "qwen-dp.yaml"), []byte(qwenDPYAML), 0o644)).To(Succeed())
|
||||
|
||||
net, err := network.New(ctx)
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
DeferCleanup(func() {
|
||||
_ = net.Remove(context.Background())
|
||||
})
|
||||
|
||||
commonMounts := testcontainers.ContainerMounts{
|
||||
{
|
||||
Source: testcontainers.DockerBindMountSource{HostPath: backendDir},
|
||||
Target: "/var/lib/local-ai/backends/vllm",
|
||||
},
|
||||
}
|
||||
|
||||
// Head: rank 0, serves the OpenAI API. We wait briefly for the
|
||||
// HTTP port to bind (so MappedPort returns), then poll /readyz
|
||||
// with a long budget for the model load + DP handshake.
|
||||
head, err := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
|
||||
ContainerRequest: testcontainers.ContainerRequest{
|
||||
Image: imageRef,
|
||||
ExposedPorts: []string{"8080/tcp"},
|
||||
Cmd: []string{"run", "/models/qwen-dp.yaml"},
|
||||
Env: map[string]string{
|
||||
"LOCALAI_ADDRESS": "0.0.0.0:8080",
|
||||
// Cap KV cache per rank so two CPU engines fit on
|
||||
// one host. The prebuilt wheel auto-sizes from
|
||||
// available RAM otherwise and OOM-kills with two
|
||||
// ranks sharing a 32 GB box.
|
||||
"VLLM_CPU_KVCACHE_SPACE": "1",
|
||||
// The backend dir is bind-mounted from the host;
|
||||
// without this, Python writes .pyc files into
|
||||
// __pycache__ as root and `rm -rf local-backends/`
|
||||
// fails on the next `make extract-backend-vllm`.
|
||||
"PYTHONDONTWRITEBYTECODE": "1",
|
||||
},
|
||||
Networks: []string{net.Name},
|
||||
NetworkAliases: map[string][]string{net.Name: {"localai-head"}},
|
||||
Mounts: append(commonMounts,
|
||||
testcontainers.ContainerMount{
|
||||
// Not read-only: LocalAI writes back auto-
|
||||
// detected hooks (parser defaults, ...) into
|
||||
// the config and HF cache files into this
|
||||
// dir.
|
||||
Source: testcontainers.DockerBindMountSource{HostPath: configDir},
|
||||
Target: "/models",
|
||||
}),
|
||||
LogConsumerCfg: &testcontainers.LogConsumerConfig{
|
||||
Consumers: []testcontainers.LogConsumer{&vllmLogConsumer{prefix: "head"}},
|
||||
},
|
||||
WaitingFor: wait.ForListeningPort("8080/tcp").WithStartupTimeout(2 * time.Minute),
|
||||
},
|
||||
Started: true,
|
||||
})
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
DeferCleanup(func() {
|
||||
_ = head.Terminate(context.Background())
|
||||
})
|
||||
|
||||
// Follower: rank 1, headless. Speaks ZMQ directly to the head
|
||||
// rank — no LocalAI gRPC; `p2p-worker vllm` exec's vllm serve.
|
||||
follower, err := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
|
||||
ContainerRequest: testcontainers.ContainerRequest{
|
||||
Image: imageRef,
|
||||
Cmd: []string{
|
||||
"p2p-worker", "vllm", "Qwen/Qwen2.5-0.5B-Instruct",
|
||||
"--data-parallel-size=2",
|
||||
"--data-parallel-size-local=1",
|
||||
"--start-rank=1",
|
||||
"--master-addr=localai-head",
|
||||
"--master-port=32100",
|
||||
// Mirror max_model_len from qwen-dp.yaml so both
|
||||
// ranks agree on the KV cache shape.
|
||||
"--vllm-arg=--max-model-len=512",
|
||||
},
|
||||
Env: map[string]string{
|
||||
"VLLM_CPU_KVCACHE_SPACE": "1",
|
||||
"PYTHONDONTWRITEBYTECODE": "1",
|
||||
},
|
||||
Networks: []string{net.Name},
|
||||
Mounts: commonMounts,
|
||||
LogConsumerCfg: &testcontainers.LogConsumerConfig{
|
||||
Consumers: []testcontainers.LogConsumer{&vllmLogConsumer{prefix: "follower"}},
|
||||
},
|
||||
},
|
||||
Started: true,
|
||||
})
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
DeferCleanup(func() {
|
||||
_ = follower.Terminate(context.Background())
|
||||
})
|
||||
|
||||
port, err := head.MappedPort(ctx, "8080/tcp")
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
baseURL = fmt.Sprintf("http://localhost:%s", port.Port())
|
||||
|
||||
Eventually(func() (int, error) {
|
||||
resp, err := http.Get(baseURL + "/readyz")
|
||||
if err != nil {
|
||||
return 0, err
|
||||
}
|
||||
defer func() { _ = resp.Body.Close() }()
|
||||
return resp.StatusCode, nil
|
||||
}, "20m", "10s").Should(Equal(http.StatusOK), "head /readyz never went green — both ranks need to load the model and complete the ZMQ handshake")
|
||||
})
|
||||
|
||||
It("serves a chat completion across both ranks", func() {
|
||||
body, err := json.Marshal(map[string]any{
|
||||
"model": "qwen-dp",
|
||||
"messages": []map[string]string{
|
||||
{"role": "user", "content": "Reply with the single word: pong."},
|
||||
},
|
||||
"max_tokens": 16,
|
||||
"temperature": 0,
|
||||
})
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
|
||||
resp, err := http.Post(baseURL+"/v1/chat/completions", "application/json", bytes.NewReader(body))
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
defer func() { _ = resp.Body.Close() }()
|
||||
|
||||
raw, err := io.ReadAll(resp.Body)
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
Expect(resp.StatusCode).To(Equal(http.StatusOK), "non-200 from chat/completions: %s", string(raw))
|
||||
|
||||
var parsed struct {
|
||||
Choices []struct {
|
||||
Message struct {
|
||||
Content string `json:"content"`
|
||||
} `json:"message"`
|
||||
} `json:"choices"`
|
||||
}
|
||||
Expect(json.Unmarshal(raw, &parsed)).To(Succeed())
|
||||
Expect(parsed.Choices).ToNot(BeEmpty())
|
||||
Expect(parsed.Choices[0].Message.Content).ToNot(BeEmpty())
|
||||
})
|
||||
})
|
||||
|
||||
type vllmLogConsumer struct {
|
||||
prefix string
|
||||
}
|
||||
|
||||
func (l *vllmLogConsumer) Accept(log testcontainers.Log) {
|
||||
_, _ = GinkgoWriter.Write([]byte("[" + l.prefix + "] " + string(log.Content)))
|
||||
}
|
||||
|
||||
func vllmEnvOrDefault(key, def string) string {
|
||||
if v := os.Getenv(key); v != "" {
|
||||
return v
|
||||
}
|
||||
return def
|
||||
}
|
||||
|
||||
// thisFileDir returns the directory of this test file so the test can
|
||||
// be run from any working directory (`go test ./...` from the repo
|
||||
// root is the common case).
|
||||
func thisFileDir() string {
|
||||
_, file, _, _ := runtime.Caller(0)
|
||||
return filepath.Dir(file)
|
||||
}
|
||||
Reference in New Issue
Block a user