mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-17 13:10:23 -04:00
* feat(vllm): build vllm from source for Intel XPU
Upstream publishes no XPU wheels for vllm. The Intel profile was
silently picking up a non-XPU wheel that imported but errored at
engine init, and several runtime deps (pillow, charset-normalizer,
chardet) were missing on Intel -- backend.py crashed at import time
before the gRPC server came up.
Switch the Intel profile to upstream's documented from-source
procedure (docs/getting_started/installation/gpu.xpu.inc.md in
vllm-project/vllm):
- Bump portable Python to 3.12 -- vllm-xpu-kernels ships only a
cp312 wheel.
- Source /opt/intel/oneapi/setvars.sh so vllm's CMake build sees
the dpcpp/sycl compiler from the oneapi-basekit base image.
- Hide requirements-intel-after.txt during installRequirements
(it used to 'pip install vllm'); install vllm's deps from a
fresh git clone of vllm via 'uv pip install -r
requirements/xpu.txt', swap stock triton for
triton-xpu==3.7.0, then 'VLLM_TARGET_DEVICE=xpu uv pip install
--no-deps .'.
- requirements-intel.txt trimmed to LocalAI's direct deps
(accelerate / transformers / bitsandbytes); torch-xpu, vllm,
vllm_xpu_kernels and the rest come from upstream's xpu.txt
during the source build.
- requirements.txt: add pillow + charset-normalizer + chardet --
used by backend.py and missing on the Intel install profile.
- run.sh: 'set -x' so backend startup is visible in container
logs (the gRPC startup error path was previously opaque).
Also adds a one-line docs example for engine_args.attention_backend
under the vLLM section, since older XE-HPG GPUs (e.g. Arc A770)
need TRITON_ATTN to bypass the cutlass path in vllm_xpu_kernels.
Tested end-to-end on an Intel Arc A770 with Qwen2.5-0.5B-Instruct
via LocalAI's /v1/chat/completions.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(vllm): add multi-node data-parallel follower worker
vLLM v1's multi-node story is one process per node sharing a DP
coordinator over ZMQ -- the head runs the API server with
data_parallel_size > 1 and followers run `vllm serve --headless ...`
with matching topology. Today LocalAI can already configure DP on the
head via the engine_args YAML map, but there's no way to bring up the
follower nodes -- so the head sits waiting for ranks that never
handshake.
Add `local-ai p2p-worker vllm`, mirroring MLXDistributed's structural
precedent (operator-launched, static config, no NATS placement). The
worker:
- Optionally self-registers with the frontend as an agent-type node
tagged `node.role=vllm-follower` so it's visible in the admin UI
and operators can scope ordinary models away via inverse
selectors.
- Resolves the platform-specific vllm backend via the gallery's
"vllm" meta-entry (cuda*, intel-vllm, rocm-vllm, ...).
- Runs vLLM as a child process so the heartbeat goroutine survives
until vLLM exits; forwards SIGINT/SIGTERM so vLLM can clean up its
ZMQ sockets before we tear down.
- Validates --headless + --start-rank 0 is rejected (rank 0 is the
head and must serve the API).
Backend run.sh dispatches `serve` as the first arg to vllm's own CLI
instead of LocalAI's backend.py gRPC server -- the follower speaks
ZMQ directly to the head, there is no LocalAI gRPC on the follower
side. Single-node usage is unchanged.
Generalises the gallery resolution helper into findBackendPath()
shared by MLX and vLLM workers; extracts ParseNodeLabels for the
comma-separated label parsing both use.
Ships with two compose recipes (`docker-compose.vllm-multinode.yaml`
for NVIDIA, `docker-compose.vllm-multinode.intel.yaml` for Intel
XPU/xccl) plus `tests/e2e/vllm-multinode/smoke.sh`. Both vendors are
supported (NCCL for CUDA/ROCm, xccl for XPU) but mixed-vendor DP is
not -- PyTorch's process group requires every rank to use the same
collective backend, and NCCL/xccl/gloo don't interoperate.
Out of scope (deferred): SmartRouter-driven placement of follower
ranks via NATS backend.install events, follower log streaming through
/api/backend-logs, tensor-parallel across nodes, disaggregated
prefill via KVTransferConfig.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* test(vllm): CPU-only end-to-end test for multi-node DP
Adds tests/e2e/vllm-multinode/, a Ginkgo + testcontainers-go suite
that brings up a head + headless follower from the locally-built
local-ai:tests image, bind-mounts the cpu-vllm backend extracted by
make extract-backend-vllm so it's seen as a system backend (no gallery
fetch, no registry server), and asserts a chat completion across both
DP ranks. New `make test-e2e-vllm-multinode` target wires the docker
build, backend extract, and ginkgo run together; BuildKit caches both
images so re-runs only rebuild what changed. Tagged Label("VLLMMultinode")
so the existing distributed suite isn't pulled along.
Two pre-existing bugs surfaced by the test:
1. extract-backend-% (Makefile) failed for every backend, because all
backend images end with `FROM scratch` and `docker create` rejects
an image with no CMD/ENTRYPOINT. Fixed by passing
--entrypoint=/run.sh -- the container is never started, only
docker-cp'd, so the path doesn't have to exist; we just need
anything that satisfies the daemon's create-time validation.
2. backend/python/vllm/run.sh's `serve` shortcut for the multi-node DP
follower exec'd ${EDIR}/venv/bin/vllm directly, but uv bakes an
absolute build-time shebang (`#!/vllm/venv/bin/python3`) that no
longer resolves once the backend is relocated to BackendsPath.
_makeVenvPortable's shebang rewriter only matches paths that
already point at ${EDIR}, so the original shebang slips through
unchanged. Fixed by exec-ing ${EDIR}/venv/bin/python with the script
as an argument -- Python ignores the script's shebang in that case.
The test fixture caps memory aggressively (max_model_len=512,
VLLM_CPU_KVCACHE_SPACE=1, TORCH_COMPILE_DISABLE=1) so two CPU engines
fit on a 32 GB box. TORCH_COMPILE_DISABLE is currently mandatory for
cpu-vllm: torch._inductor's CPU-ISA probe runs even with
enforce_eager=True and needs g++ on PATH, which the LocalAI runtime
image doesn't ship -- to be addressed in a follow-up that bundles a
toolchain in the cpu-vllm backend.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(vllm): bundle a g++ toolchain in the cpu-vllm backend image
torch._inductor's CPU-ISA probe (`cpu_model_runner.py:65 "Warming up
model for the compilation"`) shells out to `g++` at vllm engine
startup, regardless of `enforce_eager=True` -- the eager flag only
disables CUDA graphs, not inductor's first-batch warmup. The LocalAI
CPU runtime image (Dockerfile, unconditional apt list) does not ship
build-essential, and the cpu-vllm backend image is `FROM scratch`,
so any non-trivial inference on cpu-vllm crashes with:
torch._inductor.exc.InductorError:
InvalidCxxCompiler: No working C++ compiler found in
torch._inductor.config.cpp.cxx: (None, 'g++')
Bundling the toolchain in the CPU runtime image would bloat every
non-vllm-CPU deployment and force a single GCC version on backends
that may want clang or a different version. So this lives in the
backend, gated to BUILD_TYPE=='' (the CPU profile).
`package.sh` snapshots g++ + binutils + cc1plus + libstdc++ + libc6
(runtime + dev) + the math libs cc1plus links (libisl/libmpc/libmpfr/
libjansson) into ${BACKEND}/toolchain/, mirroring /usr/... layout. The
unversioned binaries on Debian/Ubuntu are symlink chains pointing into
multiarch packages (`g++` -> `g++-13` -> `x86_64-linux-gnu-g++-13`,
the latter in `g++-13-x86-64-linux-gnu`), so the package list resolves
both the version and the arch-triplet variant. Symlinks /lib ->
usr/lib and /lib64 -> usr/lib64 are recreated under the toolchain
root because Ubuntu's UsrMerge keeps them at /, and ld scripts
(`libc.so`, `libm.so`) hardcode `/lib/...` paths that --sysroot
re-roots into the toolchain.
The unversioned `g++`/`gcc`/`cpp` symlinks are replaced with wrapper
shell scripts that resolve their own location at runtime and pass
`--sysroot=<toolchain>` and `-B <toolchain>/usr/lib/gcc/<triplet>/<ver>/`
to the underlying versioned binary. That's how torch's bare `g++ foo.cpp
-o foo` invocation finds cc1plus (-B), system headers (--sysroot), and
the bundled libstdc++ (--sysroot, --sysroot is recursive into linker).
`run.sh` adds the toolchain bin dir to PATH and the toolchain's
shared-lib dir to LD_LIBRARY_PATH -- everything else (header search,
linker search, executable search) is encapsulated in the wrappers.
No-op for non-CPU builds, the dir doesn't exist there.
The cpu-vllm image grows by ~217 MB. Tradeoff is acceptable -- cpu-vllm
is already a niche profile (few users compared to GPU vllm) and the
alternative is a backend that crashes at first inference unless the
operator manually sets TORCH_COMPILE_DISABLE=1, which silently disables
all torch.compile optimizations.
Drops `TORCH_COMPILE_DISABLE=1` from tests/e2e/vllm-multinode -- the
smoke now exercises the real compile path through the bundled toolchain.
Test runtime is +20s for the warmup compile, still <90s end to end.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(vllm): scope jetson-ai-lab index to L4T-specific wheels via pyproject.toml
The L4T arm64 build resolves dependencies through pypi.jetson-ai-lab.io,
which hosts the L4T-specific torch / vllm / flash-attn wheels but also
transparently proxies the rest of PyPI through `/+f/<sha>/<filename>`
URLs. With `--extra-index-url` + `--index-strategy=unsafe-best-match`
uv would pick those proxy URLs for ordinary PyPI packages —
anthropic/openai/propcache/annotated-types — and fail when the proxy
503s. Master is hitting the same bug on its own l4t-vllm matrix entry.
Switch the l4t13 install path to a pyproject.toml that marks the
jetson-ai-lab index `explicit = true` and pins only torch, torchvision,
torchaudio, flash-attn, and vllm to it via [tool.uv.sources]. uv won't
consult the L4T mirror for anything else, so transitive deps fall back
to PyPI as the default index — no exposure to the proxy 503s.
`uv pip install -r requirements.txt` ignores [tool.uv.sources], so the
l4t13 branch in install.sh now invokes `uv pip install --requirement
pyproject.toml` directly, replacing the old requirements-l4t13*.txt
files. Other BUILD_PROFILEs continue using libbackend.sh's
installRequirements and never read pyproject.toml.
Local resolution test (x86_64, dry-run) confirms uv hits the L4T
index for torch and falls through to PyPI for everything else.
Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
246 lines
8.3 KiB
Go
246 lines
8.3 KiB
Go
package distributed_test
|
|
|
|
import (
|
|
"bytes"
|
|
"context"
|
|
"encoding/json"
|
|
"fmt"
|
|
"io"
|
|
"net/http"
|
|
"os"
|
|
"path/filepath"
|
|
"runtime"
|
|
"time"
|
|
|
|
. "github.com/onsi/ginkgo/v2"
|
|
. "github.com/onsi/gomega"
|
|
|
|
"github.com/testcontainers/testcontainers-go"
|
|
"github.com/testcontainers/testcontainers-go/network"
|
|
"github.com/testcontainers/testcontainers-go/wait"
|
|
)
|
|
|
|
// vLLM data-parallel deployment config served by the head. KV cache is
|
|
// trimmed because the CPU smoke runs two engines on one box and the
|
|
// prebuilt wheel auto-sizes KV to fill RAM otherwise.
|
|
const qwenDPYAML = `name: qwen-dp
|
|
backend: vllm
|
|
parameters:
|
|
model: Qwen/Qwen2.5-0.5B-Instruct
|
|
context_size: 512
|
|
trust_remote_code: true
|
|
template:
|
|
use_tokenizer_template: true
|
|
engine_args:
|
|
data_parallel_size: 2
|
|
data_parallel_size_local: 1
|
|
data_parallel_address: localai-head
|
|
data_parallel_rpc_port: 32100
|
|
enforce_eager: true
|
|
max_model_len: 512
|
|
`
|
|
|
|
// End-to-end smoke for `local-ai p2p-worker vllm`. Two containers from
|
|
// the locally-built `local-ai:tests` image — head + headless follower
|
|
// — share a docker network and a backend bind-mount (so the cpu-vllm
|
|
// backend extracted by `make extract-backend-vllm` is seen as a system
|
|
// backend, no gallery fetch). DP=2 on a 0.5B model on CPU; the test
|
|
// asserts /readyz comes up across both ranks and a chat completion
|
|
// returns non-empty content.
|
|
//
|
|
// Required preconditions (the `test-e2e-vllm-multinode` Make target
|
|
// sets these up):
|
|
// - `local-ai:tests` image built (docker-build-e2e)
|
|
// - `local-backends/vllm/` populated (extract-backend-vllm)
|
|
// - LOCALAI_VLLM_BACKEND_DIR env var pointing at the extracted dir
|
|
var _ = Describe("vLLM multi-node DP on CPU", Ordered, Label("Distributed", "VLLMMultinode"), func() {
|
|
var baseURL string
|
|
|
|
BeforeAll(func() {
|
|
ctx := context.Background()
|
|
|
|
image := vllmEnvOrDefault("LOCALAI_IMAGE", "local-ai")
|
|
tag := vllmEnvOrDefault("LOCALAI_IMAGE_TAG", "tests")
|
|
imageRef := fmt.Sprintf("%s:%s", image, tag)
|
|
|
|
// LOCALAI_VLLM_BACKEND_DIR is set by the dedicated
|
|
// `make test-e2e-vllm-multinode` target. The general
|
|
// `make test-e2e` target picks this file up too via
|
|
// `ginkgo -r ./tests/e2e`; in that context skip rather
|
|
// than fail.
|
|
backendDir := os.Getenv("LOCALAI_VLLM_BACKEND_DIR")
|
|
if backendDir == "" {
|
|
Skip("LOCALAI_VLLM_BACKEND_DIR not set — run `make test-e2e-vllm-multinode`")
|
|
}
|
|
Expect(filepath.Join(backendDir, "run.sh")).To(BeAnExistingFile(),
|
|
"extracted backend missing run.sh — check the extract-backend-vllm output")
|
|
|
|
// State dir for the head: holds qwen-dp.yaml and is also where
|
|
// LocalAI redirects HF_HOME for backend subprocesses
|
|
// (pkg/model/initializers.go:76), so Qwen weights accumulate
|
|
// here. Stable gitignored path under local-backends/ so the
|
|
// container's root-owned writes don't trip Ginkgo's TempDir
|
|
// cleanup, and successive runs reuse the ~1 GB download.
|
|
configDir := filepath.Join(thisFileDir(), "..", "..", "..", "local-backends", "vllm-multinode-state")
|
|
Expect(os.MkdirAll(configDir, 0o755)).To(Succeed())
|
|
Expect(os.WriteFile(filepath.Join(configDir, "qwen-dp.yaml"), []byte(qwenDPYAML), 0o644)).To(Succeed())
|
|
|
|
net, err := network.New(ctx)
|
|
Expect(err).ToNot(HaveOccurred())
|
|
DeferCleanup(func() {
|
|
_ = net.Remove(context.Background())
|
|
})
|
|
|
|
commonMounts := testcontainers.ContainerMounts{
|
|
{
|
|
Source: testcontainers.DockerBindMountSource{HostPath: backendDir},
|
|
Target: "/var/lib/local-ai/backends/vllm",
|
|
},
|
|
}
|
|
|
|
// Head: rank 0, serves the OpenAI API. We wait briefly for the
|
|
// HTTP port to bind (so MappedPort returns), then poll /readyz
|
|
// with a long budget for the model load + DP handshake.
|
|
head, err := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
|
|
ContainerRequest: testcontainers.ContainerRequest{
|
|
Image: imageRef,
|
|
ExposedPorts: []string{"8080/tcp"},
|
|
Cmd: []string{"run", "/models/qwen-dp.yaml"},
|
|
Env: map[string]string{
|
|
"LOCALAI_ADDRESS": "0.0.0.0:8080",
|
|
// Cap KV cache per rank so two CPU engines fit on
|
|
// one host. The prebuilt wheel auto-sizes from
|
|
// available RAM otherwise and OOM-kills with two
|
|
// ranks sharing a 32 GB box.
|
|
"VLLM_CPU_KVCACHE_SPACE": "1",
|
|
// The backend dir is bind-mounted from the host;
|
|
// without this, Python writes .pyc files into
|
|
// __pycache__ as root and `rm -rf local-backends/`
|
|
// fails on the next `make extract-backend-vllm`.
|
|
"PYTHONDONTWRITEBYTECODE": "1",
|
|
},
|
|
Networks: []string{net.Name},
|
|
NetworkAliases: map[string][]string{net.Name: {"localai-head"}},
|
|
Mounts: append(commonMounts,
|
|
testcontainers.ContainerMount{
|
|
// Not read-only: LocalAI writes back auto-
|
|
// detected hooks (parser defaults, ...) into
|
|
// the config and HF cache files into this
|
|
// dir.
|
|
Source: testcontainers.DockerBindMountSource{HostPath: configDir},
|
|
Target: "/models",
|
|
}),
|
|
LogConsumerCfg: &testcontainers.LogConsumerConfig{
|
|
Consumers: []testcontainers.LogConsumer{&vllmLogConsumer{prefix: "head"}},
|
|
},
|
|
WaitingFor: wait.ForListeningPort("8080/tcp").WithStartupTimeout(2 * time.Minute),
|
|
},
|
|
Started: true,
|
|
})
|
|
Expect(err).ToNot(HaveOccurred())
|
|
DeferCleanup(func() {
|
|
_ = head.Terminate(context.Background())
|
|
})
|
|
|
|
// Follower: rank 1, headless. Speaks ZMQ directly to the head
|
|
// rank — no LocalAI gRPC; `p2p-worker vllm` exec's vllm serve.
|
|
follower, err := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
|
|
ContainerRequest: testcontainers.ContainerRequest{
|
|
Image: imageRef,
|
|
Cmd: []string{
|
|
"p2p-worker", "vllm", "Qwen/Qwen2.5-0.5B-Instruct",
|
|
"--data-parallel-size=2",
|
|
"--data-parallel-size-local=1",
|
|
"--start-rank=1",
|
|
"--master-addr=localai-head",
|
|
"--master-port=32100",
|
|
// Mirror max_model_len from qwen-dp.yaml so both
|
|
// ranks agree on the KV cache shape.
|
|
"--vllm-arg=--max-model-len=512",
|
|
},
|
|
Env: map[string]string{
|
|
"VLLM_CPU_KVCACHE_SPACE": "1",
|
|
"PYTHONDONTWRITEBYTECODE": "1",
|
|
},
|
|
Networks: []string{net.Name},
|
|
Mounts: commonMounts,
|
|
LogConsumerCfg: &testcontainers.LogConsumerConfig{
|
|
Consumers: []testcontainers.LogConsumer{&vllmLogConsumer{prefix: "follower"}},
|
|
},
|
|
},
|
|
Started: true,
|
|
})
|
|
Expect(err).ToNot(HaveOccurred())
|
|
DeferCleanup(func() {
|
|
_ = follower.Terminate(context.Background())
|
|
})
|
|
|
|
port, err := head.MappedPort(ctx, "8080/tcp")
|
|
Expect(err).ToNot(HaveOccurred())
|
|
baseURL = fmt.Sprintf("http://localhost:%s", port.Port())
|
|
|
|
Eventually(func() (int, error) {
|
|
resp, err := http.Get(baseURL + "/readyz")
|
|
if err != nil {
|
|
return 0, err
|
|
}
|
|
defer func() { _ = resp.Body.Close() }()
|
|
return resp.StatusCode, nil
|
|
}, "20m", "10s").Should(Equal(http.StatusOK), "head /readyz never went green — both ranks need to load the model and complete the ZMQ handshake")
|
|
})
|
|
|
|
It("serves a chat completion across both ranks", func() {
|
|
body, err := json.Marshal(map[string]any{
|
|
"model": "qwen-dp",
|
|
"messages": []map[string]string{
|
|
{"role": "user", "content": "Reply with the single word: pong."},
|
|
},
|
|
"max_tokens": 16,
|
|
"temperature": 0,
|
|
})
|
|
Expect(err).ToNot(HaveOccurred())
|
|
|
|
resp, err := http.Post(baseURL+"/v1/chat/completions", "application/json", bytes.NewReader(body))
|
|
Expect(err).ToNot(HaveOccurred())
|
|
defer func() { _ = resp.Body.Close() }()
|
|
|
|
raw, err := io.ReadAll(resp.Body)
|
|
Expect(err).ToNot(HaveOccurred())
|
|
Expect(resp.StatusCode).To(Equal(http.StatusOK), "non-200 from chat/completions: %s", string(raw))
|
|
|
|
var parsed struct {
|
|
Choices []struct {
|
|
Message struct {
|
|
Content string `json:"content"`
|
|
} `json:"message"`
|
|
} `json:"choices"`
|
|
}
|
|
Expect(json.Unmarshal(raw, &parsed)).To(Succeed())
|
|
Expect(parsed.Choices).ToNot(BeEmpty())
|
|
Expect(parsed.Choices[0].Message.Content).ToNot(BeEmpty())
|
|
})
|
|
})
|
|
|
|
type vllmLogConsumer struct {
|
|
prefix string
|
|
}
|
|
|
|
func (l *vllmLogConsumer) Accept(log testcontainers.Log) {
|
|
_, _ = GinkgoWriter.Write([]byte("[" + l.prefix + "] " + string(log.Content)))
|
|
}
|
|
|
|
func vllmEnvOrDefault(key, def string) string {
|
|
if v := os.Getenv(key); v != "" {
|
|
return v
|
|
}
|
|
return def
|
|
}
|
|
|
|
// thisFileDir returns the directory of this test file so the test can
|
|
// be run from any working directory (`go test ./...` from the repo
|
|
// root is the common case).
|
|
func thisFileDir() string {
|
|
_, file, _, _ := runtime.Caller(0)
|
|
return filepath.Dir(file)
|
|
}
|