mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-17 13:10:23 -04:00
* feat(vllm): build vllm from source for Intel XPU
Upstream publishes no XPU wheels for vllm. The Intel profile was
silently picking up a non-XPU wheel that imported but errored at
engine init, and several runtime deps (pillow, charset-normalizer,
chardet) were missing on Intel -- backend.py crashed at import time
before the gRPC server came up.
Switch the Intel profile to upstream's documented from-source
procedure (docs/getting_started/installation/gpu.xpu.inc.md in
vllm-project/vllm):
- Bump portable Python to 3.12 -- vllm-xpu-kernels ships only a
cp312 wheel.
- Source /opt/intel/oneapi/setvars.sh so vllm's CMake build sees
the dpcpp/sycl compiler from the oneapi-basekit base image.
- Hide requirements-intel-after.txt during installRequirements
(it used to 'pip install vllm'); install vllm's deps from a
fresh git clone of vllm via 'uv pip install -r
requirements/xpu.txt', swap stock triton for
triton-xpu==3.7.0, then 'VLLM_TARGET_DEVICE=xpu uv pip install
--no-deps .'.
- requirements-intel.txt trimmed to LocalAI's direct deps
(accelerate / transformers / bitsandbytes); torch-xpu, vllm,
vllm_xpu_kernels and the rest come from upstream's xpu.txt
during the source build.
- requirements.txt: add pillow + charset-normalizer + chardet --
used by backend.py and missing on the Intel install profile.
- run.sh: 'set -x' so backend startup is visible in container
logs (the gRPC startup error path was previously opaque).
Also adds a one-line docs example for engine_args.attention_backend
under the vLLM section, since older XE-HPG GPUs (e.g. Arc A770)
need TRITON_ATTN to bypass the cutlass path in vllm_xpu_kernels.
Tested end-to-end on an Intel Arc A770 with Qwen2.5-0.5B-Instruct
via LocalAI's /v1/chat/completions.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(vllm): add multi-node data-parallel follower worker
vLLM v1's multi-node story is one process per node sharing a DP
coordinator over ZMQ -- the head runs the API server with
data_parallel_size > 1 and followers run `vllm serve --headless ...`
with matching topology. Today LocalAI can already configure DP on the
head via the engine_args YAML map, but there's no way to bring up the
follower nodes -- so the head sits waiting for ranks that never
handshake.
Add `local-ai p2p-worker vllm`, mirroring MLXDistributed's structural
precedent (operator-launched, static config, no NATS placement). The
worker:
- Optionally self-registers with the frontend as an agent-type node
tagged `node.role=vllm-follower` so it's visible in the admin UI
and operators can scope ordinary models away via inverse
selectors.
- Resolves the platform-specific vllm backend via the gallery's
"vllm" meta-entry (cuda*, intel-vllm, rocm-vllm, ...).
- Runs vLLM as a child process so the heartbeat goroutine survives
until vLLM exits; forwards SIGINT/SIGTERM so vLLM can clean up its
ZMQ sockets before we tear down.
- Validates --headless + --start-rank 0 is rejected (rank 0 is the
head and must serve the API).
Backend run.sh dispatches `serve` as the first arg to vllm's own CLI
instead of LocalAI's backend.py gRPC server -- the follower speaks
ZMQ directly to the head, there is no LocalAI gRPC on the follower
side. Single-node usage is unchanged.
Generalises the gallery resolution helper into findBackendPath()
shared by MLX and vLLM workers; extracts ParseNodeLabels for the
comma-separated label parsing both use.
Ships with two compose recipes (`docker-compose.vllm-multinode.yaml`
for NVIDIA, `docker-compose.vllm-multinode.intel.yaml` for Intel
XPU/xccl) plus `tests/e2e/vllm-multinode/smoke.sh`. Both vendors are
supported (NCCL for CUDA/ROCm, xccl for XPU) but mixed-vendor DP is
not -- PyTorch's process group requires every rank to use the same
collective backend, and NCCL/xccl/gloo don't interoperate.
Out of scope (deferred): SmartRouter-driven placement of follower
ranks via NATS backend.install events, follower log streaming through
/api/backend-logs, tensor-parallel across nodes, disaggregated
prefill via KVTransferConfig.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* test(vllm): CPU-only end-to-end test for multi-node DP
Adds tests/e2e/vllm-multinode/, a Ginkgo + testcontainers-go suite
that brings up a head + headless follower from the locally-built
local-ai:tests image, bind-mounts the cpu-vllm backend extracted by
make extract-backend-vllm so it's seen as a system backend (no gallery
fetch, no registry server), and asserts a chat completion across both
DP ranks. New `make test-e2e-vllm-multinode` target wires the docker
build, backend extract, and ginkgo run together; BuildKit caches both
images so re-runs only rebuild what changed. Tagged Label("VLLMMultinode")
so the existing distributed suite isn't pulled along.
Two pre-existing bugs surfaced by the test:
1. extract-backend-% (Makefile) failed for every backend, because all
backend images end with `FROM scratch` and `docker create` rejects
an image with no CMD/ENTRYPOINT. Fixed by passing
--entrypoint=/run.sh -- the container is never started, only
docker-cp'd, so the path doesn't have to exist; we just need
anything that satisfies the daemon's create-time validation.
2. backend/python/vllm/run.sh's `serve` shortcut for the multi-node DP
follower exec'd ${EDIR}/venv/bin/vllm directly, but uv bakes an
absolute build-time shebang (`#!/vllm/venv/bin/python3`) that no
longer resolves once the backend is relocated to BackendsPath.
_makeVenvPortable's shebang rewriter only matches paths that
already point at ${EDIR}, so the original shebang slips through
unchanged. Fixed by exec-ing ${EDIR}/venv/bin/python with the script
as an argument -- Python ignores the script's shebang in that case.
The test fixture caps memory aggressively (max_model_len=512,
VLLM_CPU_KVCACHE_SPACE=1, TORCH_COMPILE_DISABLE=1) so two CPU engines
fit on a 32 GB box. TORCH_COMPILE_DISABLE is currently mandatory for
cpu-vllm: torch._inductor's CPU-ISA probe runs even with
enforce_eager=True and needs g++ on PATH, which the LocalAI runtime
image doesn't ship -- to be addressed in a follow-up that bundles a
toolchain in the cpu-vllm backend.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(vllm): bundle a g++ toolchain in the cpu-vllm backend image
torch._inductor's CPU-ISA probe (`cpu_model_runner.py:65 "Warming up
model for the compilation"`) shells out to `g++` at vllm engine
startup, regardless of `enforce_eager=True` -- the eager flag only
disables CUDA graphs, not inductor's first-batch warmup. The LocalAI
CPU runtime image (Dockerfile, unconditional apt list) does not ship
build-essential, and the cpu-vllm backend image is `FROM scratch`,
so any non-trivial inference on cpu-vllm crashes with:
torch._inductor.exc.InductorError:
InvalidCxxCompiler: No working C++ compiler found in
torch._inductor.config.cpp.cxx: (None, 'g++')
Bundling the toolchain in the CPU runtime image would bloat every
non-vllm-CPU deployment and force a single GCC version on backends
that may want clang or a different version. So this lives in the
backend, gated to BUILD_TYPE=='' (the CPU profile).
`package.sh` snapshots g++ + binutils + cc1plus + libstdc++ + libc6
(runtime + dev) + the math libs cc1plus links (libisl/libmpc/libmpfr/
libjansson) into ${BACKEND}/toolchain/, mirroring /usr/... layout. The
unversioned binaries on Debian/Ubuntu are symlink chains pointing into
multiarch packages (`g++` -> `g++-13` -> `x86_64-linux-gnu-g++-13`,
the latter in `g++-13-x86-64-linux-gnu`), so the package list resolves
both the version and the arch-triplet variant. Symlinks /lib ->
usr/lib and /lib64 -> usr/lib64 are recreated under the toolchain
root because Ubuntu's UsrMerge keeps them at /, and ld scripts
(`libc.so`, `libm.so`) hardcode `/lib/...` paths that --sysroot
re-roots into the toolchain.
The unversioned `g++`/`gcc`/`cpp` symlinks are replaced with wrapper
shell scripts that resolve their own location at runtime and pass
`--sysroot=<toolchain>` and `-B <toolchain>/usr/lib/gcc/<triplet>/<ver>/`
to the underlying versioned binary. That's how torch's bare `g++ foo.cpp
-o foo` invocation finds cc1plus (-B), system headers (--sysroot), and
the bundled libstdc++ (--sysroot, --sysroot is recursive into linker).
`run.sh` adds the toolchain bin dir to PATH and the toolchain's
shared-lib dir to LD_LIBRARY_PATH -- everything else (header search,
linker search, executable search) is encapsulated in the wrappers.
No-op for non-CPU builds, the dir doesn't exist there.
The cpu-vllm image grows by ~217 MB. Tradeoff is acceptable -- cpu-vllm
is already a niche profile (few users compared to GPU vllm) and the
alternative is a backend that crashes at first inference unless the
operator manually sets TORCH_COMPILE_DISABLE=1, which silently disables
all torch.compile optimizations.
Drops `TORCH_COMPILE_DISABLE=1` from tests/e2e/vllm-multinode -- the
smoke now exercises the real compile path through the bundled toolchain.
Test runtime is +20s for the warmup compile, still <90s end to end.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(vllm): scope jetson-ai-lab index to L4T-specific wheels via pyproject.toml
The L4T arm64 build resolves dependencies through pypi.jetson-ai-lab.io,
which hosts the L4T-specific torch / vllm / flash-attn wheels but also
transparently proxies the rest of PyPI through `/+f/<sha>/<filename>`
URLs. With `--extra-index-url` + `--index-strategy=unsafe-best-match`
uv would pick those proxy URLs for ordinary PyPI packages —
anthropic/openai/propcache/annotated-types — and fail when the proxy
503s. Master is hitting the same bug on its own l4t-vllm matrix entry.
Switch the l4t13 install path to a pyproject.toml that marks the
jetson-ai-lab index `explicit = true` and pins only torch, torchvision,
torchaudio, flash-attn, and vllm to it via [tool.uv.sources]. uv won't
consult the L4T mirror for anything else, so transitive deps fall back
to PyPI as the default index — no exposure to the proxy 503s.
`uv pip install -r requirements.txt` ignores [tool.uv.sources], so the
l4t13 branch in install.sh now invokes `uv pip install --requirement
pyproject.toml` directly, replacing the old requirements-l4t13*.txt
files. Other BUILD_PROFILEs continue using libbackend.sh's
installRequirements and never read pyproject.toml.
Local resolution test (x86_64, dry-run) confirms uv hits the L4T
index for torch and falls through to PyPI for everything else.
Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
154 lines
6.8 KiB
Bash
Executable File
154 lines
6.8 KiB
Bash
Executable File
#!/bin/bash
|
|
# Script to package runtime shared libraries for the vllm backend.
|
|
#
|
|
# The final Dockerfile.python stage is FROM scratch, so system libraries
|
|
# must be explicitly copied into ${BACKEND}/lib so the backend can run on
|
|
# any host without installing them. libbackend.sh automatically adds that
|
|
# directory to LD_LIBRARY_PATH at run time.
|
|
#
|
|
# vllm's CPU C++ extension (vllm._C) dlopens libnuma.so.1 at import time;
|
|
# if it's missing, the _C_utils torch ops are never registered and the
|
|
# engine crashes with AttributeError on init_cpu_threads_env. libgomp is
|
|
# used by torch's CPU kernels; on some stripped-down hosts it's also
|
|
# absent, so we bundle it too.
|
|
|
|
set -e
|
|
|
|
CURDIR=$(dirname "$(realpath "$0")")
|
|
LIB_DIR="${CURDIR}/lib"
|
|
mkdir -p "${LIB_DIR}"
|
|
|
|
copy_with_symlinks() {
|
|
local soname="$1"
|
|
local hit=""
|
|
for dir in /usr/lib/x86_64-linux-gnu /usr/lib/aarch64-linux-gnu /lib/x86_64-linux-gnu /lib/aarch64-linux-gnu /usr/lib /lib; do
|
|
if [ -e "${dir}/${soname}" ]; then
|
|
hit="${dir}/${soname}"
|
|
break
|
|
fi
|
|
done
|
|
if [ -z "${hit}" ]; then
|
|
echo "warning: ${soname} not found in standard lib paths" >&2
|
|
return 0
|
|
fi
|
|
# Follow the symlink to the real file, copy it, then recreate the symlink.
|
|
local real
|
|
real=$(readlink -f "${hit}")
|
|
cp -v "${real}" "${LIB_DIR}/"
|
|
local real_base
|
|
real_base=$(basename "${real}")
|
|
if [ "${real_base}" != "${soname}" ]; then
|
|
ln -sf "${real_base}" "${LIB_DIR}/${soname}"
|
|
fi
|
|
}
|
|
|
|
copy_with_symlinks libnuma.so.1
|
|
copy_with_symlinks libgomp.so.1
|
|
|
|
# CPU profile only: bundle a g++ toolchain so torch._inductor's
|
|
# ISA probe (always run at vllm engine startup, regardless of
|
|
# enforce_eager) finds a C++ compiler. The LocalAI runtime image
|
|
# is FROM ubuntu:24.04 with a minimal apt list that does not
|
|
# include build-essential, and the backend image itself is FROM
|
|
# scratch -- so without this, cpu-vllm crashes with
|
|
# torch._inductor.exc.InvalidCxxCompiler at first inference
|
|
# unless the operator manually sets TORCH_COMPILE_DISABLE=1.
|
|
#
|
|
# We snapshot every file owned by the toolchain packages, mirroring
|
|
# the /usr/... layout into ${BACKEND}/toolchain/ so g++ can find
|
|
# cc1plus, headers, libs etc. via GCC_EXEC_PREFIX / CPATH /
|
|
# LIBRARY_PATH at runtime (libbackend.sh wires those up). Adds
|
|
# ~400 MB to the cpu-vllm image, which is tolerable -- cpu-vllm is
|
|
# already a niche profile.
|
|
if [ "${BUILD_TYPE:-}" = "" ] && command -v dpkg-query >/dev/null 2>&1; then
|
|
TOOLCHAIN_DIR="${CURDIR}/toolchain"
|
|
mkdir -p "${TOOLCHAIN_DIR}"
|
|
# The unversioned g++/gcc packages on Debian/Ubuntu only ship
|
|
# symlinks; the actual binaries live in g++-${VER}/gcc-${VER}.
|
|
# Discover the active version so the symlink targets get bundled
|
|
# along with their owners.
|
|
GCC_VER=$(gcc -dumpversion 2>/dev/null | cut -d. -f1 || true)
|
|
# `g++-${VER}` itself is just another symlink layer on Debian/
|
|
# Ubuntu — the real binary `x86_64-linux-gnu-g++-${VER}` lives
|
|
# in `g++-${VER}-x86-64-linux-gnu` (a separate package pulled in
|
|
# as a dependency). Same story for gcc/cpp. Compute the dpkg
|
|
# arch-triplet to find the right package name for both amd64 and
|
|
# arm64 hosts.
|
|
case "$(dpkg --print-architecture 2>/dev/null)" in
|
|
amd64) HOST_TRIPLET="x86-64-linux-gnu" ;;
|
|
arm64) HOST_TRIPLET="aarch64-linux-gnu" ;;
|
|
*) HOST_TRIPLET="" ;;
|
|
esac
|
|
PKGS=(g++ gcc cpp libstdc++-${GCC_VER}-dev libgcc-${GCC_VER}-dev libc6 libc6-dev binutils binutils-common libbinutils libc-dev-bin linux-libc-dev libcrypt-dev libgomp1 libstdc++6 libgcc-s1 libisl23 libmpc3 libmpfr6 libjansson4 libctf0 libctf-nobfd0 libsframe1)
|
|
if [ -n "${GCC_VER}" ]; then
|
|
PKGS+=("g++-${GCC_VER}" "gcc-${GCC_VER}" "cpp-${GCC_VER}" "gcc-${GCC_VER}-base")
|
|
if [ -n "${HOST_TRIPLET}" ]; then
|
|
PKGS+=(
|
|
"g++-${GCC_VER}-${HOST_TRIPLET}"
|
|
"gcc-${GCC_VER}-${HOST_TRIPLET}"
|
|
"cpp-${GCC_VER}-${HOST_TRIPLET}"
|
|
"binutils-${HOST_TRIPLET}"
|
|
)
|
|
fi
|
|
fi
|
|
for pkg in "${PKGS[@]}"; do
|
|
if ! dpkg-query -W "${pkg}" >/dev/null 2>&1; then
|
|
continue
|
|
fi
|
|
# Copy each owned path, preserving symlinks and mode. We
|
|
# tolerate dpkg listing directories alongside files.
|
|
dpkg -L "${pkg}" | while IFS= read -r path; do
|
|
if [ -L "${path}" ] || [ -f "${path}" ]; then
|
|
mkdir -p "${TOOLCHAIN_DIR}$(dirname "${path}")"
|
|
cp -aP "${path}" "${TOOLCHAIN_DIR}${path}" 2>/dev/null || true
|
|
fi
|
|
done
|
|
done
|
|
# Ubuntu's filesystem layout has /lib -> /usr/lib (UsrMerge) and
|
|
# /lib64 -> /usr/lib64. ld scripts (e.g. libm.so) hardcode
|
|
# `/lib/x86_64-linux-gnu/libm.so.6`; with --sysroot the linker
|
|
# looks for that path under the sysroot, which means we need
|
|
# the same symlinks under TOOLCHAIN_DIR.
|
|
[ -e "${TOOLCHAIN_DIR}/lib" ] || ln -s usr/lib "${TOOLCHAIN_DIR}/lib"
|
|
[ -e "${TOOLCHAIN_DIR}/lib64" ] || ln -s usr/lib64 "${TOOLCHAIN_DIR}/lib64"
|
|
|
|
# Replace the unversioned g++/gcc/cpp symlinks with wrapper
|
|
# scripts that pass --sysroot=<toolchain> and -B <gcc-exec-prefix>.
|
|
# Without these flags gcc would fall back to its compiled-in
|
|
# /usr search and fail to find headers (the runtime image has no
|
|
# libc6-dev) or fail to invoke `as`/`ld` (binutils not on PATH at
|
|
# /usr/bin). Wrappers self-resolve their location at runtime so
|
|
# they work from any BackendsPath.
|
|
BIN_DIR="${TOOLCHAIN_DIR}/usr/bin"
|
|
if [ -n "${GCC_VER}" ] && [ -n "${HOST_TRIPLET}" ]; then
|
|
# HOST_TRIPLET in package names uses dashes ("x86-64-linux-gnu");
|
|
# the binary suffix uses underscores in the arch part
|
|
# ("x86_64-linux-gnu-g++-13"). Translate.
|
|
BIN_TRIPLET=${HOST_TRIPLET//x86-64/x86_64}
|
|
for tool in g++ gcc cpp; do
|
|
real="${BIN_DIR}/${BIN_TRIPLET}-${tool}-${GCC_VER}"
|
|
if [ -x "${real}" ]; then
|
|
rm -f "${BIN_DIR}/${tool}" "${BIN_DIR}/${tool}-${GCC_VER}"
|
|
cat > "${BIN_DIR}/${tool}" <<EOF
|
|
#!/bin/bash
|
|
# Auto-generated by package.sh. Passes --sysroot and -B so the
|
|
# bundled toolchain works from any BackendsPath without depending
|
|
# on libc6-dev / binutils being installed at /usr in the runtime
|
|
# image. See backend/python/vllm/package.sh.
|
|
DIR="\$(dirname "\$(readlink -f "\$0")")" # …/toolchain/usr/bin
|
|
SYSROOT="\$(dirname "\$(dirname "\${DIR}")")" # …/toolchain
|
|
exec "\${DIR}/${BIN_TRIPLET}-${tool}-${GCC_VER}" \\
|
|
-B "\${SYSROOT}/usr/lib/gcc/${BIN_TRIPLET}/${GCC_VER}/" \\
|
|
--sysroot="\${SYSROOT}" \\
|
|
"\$@"
|
|
EOF
|
|
chmod +x "${BIN_DIR}/${tool}"
|
|
fi
|
|
done
|
|
fi
|
|
echo "Bundled g++ toolchain (gcc-${GCC_VER}) into ${TOOLCHAIN_DIR} ($(du -sh "${TOOLCHAIN_DIR}" | cut -f1))"
|
|
fi
|
|
|
|
echo "vllm packaging completed successfully"
|
|
ls -liah "${LIB_DIR}/"
|