mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-26 09:26:55 -04:00
feat(vllm): macOS/Metal support via vllm-metal (MLX) (#10489)
* feat(vllm): macOS/Metal support via vllm-metal (MLX) Add an additive Apple-Silicon path to the existing vllm Python backend so vLLM runs on macOS via vllm-metal (github.com/vllm-project/vllm-metal). Spike outcome (proven on a real M4 / macOS 26.5, Qwen3-0.6B): - vllm-metal registers through vLLM's platform-plugin entry point (metal -> vllm_metal:register); MetalPlatform activates and runs on the GPU through MLX. - LocalAI's backend.py is UNCHANGED: AsyncEngineArgs(...) -> AsyncLLMEngine.from_engine_args transparently resolves to vLLM 0.23's v1 AsyncLLM MLX engine, and async generate produced correct output. - backend.py is NOT touched: its only empty_cache() call is CUDA-only (guarded by torch.cuda.is_available()), so the benign shutdown-only "Allocator for mps is not a DeviceAllocator" noise comes from vLLM's internal EngineCore teardown, not from our code. Changes (all gated behind a darwin condition; Linux/CUDA/ROCm/Intel paths are byte-for-byte unchanged): - install.sh: darwin branch forces PYTHON_VERSION=3.12 (vllm-metal requirement), creates/activates LocalAI's managed venv via ensureVenv, then reproduces vllm-metal's installer INTO that venv (build vLLM 0.23.0 from the release source tarball against requirements/cpu.txt, then install the prebuilt vllm-metal wheel from its latest GitHub release), and runs runProtogen. installRequirements is skipped on darwin. - backend-matrix.yml: add a vllm includeDarwin entry (mps, python). - index.yaml: add metal capability + concrete metal-vllm / metal-vllm-development child entries mirroring the metal-kitten-tts template. Version coupling: vllm-metal pins vLLM 0.23.0, equal to LocalAI's current vllm pin. Bumping vllm must be coordinated with a supporting vllm-metal release; documented in install.sh and requirements-cublas13-after.txt. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] * chore(vllm): track the darwin vllm-metal pin via the autobumper The Apple Silicon build pinned vLLM 0.23.0 as a hidden string in install.sh while floating the vllm-metal wheel on releases/latest - the two could drift apart silently. Make both a tracked, reproducible pair (VLLM_METAL_VERSION + VLLM_VERSION), fetch the wheel by tag, and add .github/bump_vllm_metal.sh wired into bump_deps.yaml. It tracks vllm-project/vllm-metal (not vllm/vllm latest), reading the coupled vLLM source version from vllm-metal's own installer, and opens a bump PR - mirroring the existing bump_vllm_wheel.sh for the cu130 wheel. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] * chore(vllm): derive the darwin vLLM version, drop the second pin Follow-up: VLLM_VERSION was still a hardcoded string duplicating what VLLM_METAL_VERSION already determines. Derive it at install time from vllm-metal's own installer (vllm_v=) at the pinned tag - one source of truth, no second value to drift. The bumper now touches only VLLM_METAL_VERSION; the derivation is immutable per tag, so builds stay reproducible. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] * fix(vllm): fetch the vllm-metal wheel without the GitHub API The darwin build resolved the wheel URL via api.github.com, whose unauthenticated rate limit (60/hr per IP) 403s on shared macOS runners (observed after the 9-min vLLM source build). Construct the release-asset download URL deterministically from the pinned tag and the cp312/arm64 wheel name instead - no API call, no rate limit. Verified the URL resolves (200). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] * fix(vllm): fail Score cleanly when the engine returns no prompt_logprobs Audit of the Score path against vllm-metal (MLX on macOS): the engine accepts SamplingParams(prompt_logprobs=1) but returns an all-None prompt_logprobs list rather than computing it, so scoring is not supported there. The old guard treated the truthy [None] list as valid and silently scored every candidate as 0. Detect the all-None case and return UNIMPLEMENTED instead. No-op on Linux/CUDA, which populate real entries. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -457,9 +457,14 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
if last_output is None or not getattr(last_output, "prompt_logprobs", None):
|
||||
context.set_code(grpc.StatusCode.INTERNAL)
|
||||
context.set_details("vLLM did not return prompt_logprobs")
|
||||
_pl = getattr(last_output, "prompt_logprobs", None) if last_output is not None else None
|
||||
# Some engines accept the prompt_logprobs request but return a
|
||||
# list of all-None entries instead of computing them (observed
|
||||
# with vllm-metal's MLX backend on macOS). Treat that as
|
||||
# unsupported rather than silently scoring every candidate as 0.
|
||||
if not _pl or all(e is None for e in _pl):
|
||||
context.set_code(grpc.StatusCode.UNIMPLEMENTED)
|
||||
context.set_details("This backend did not return prompt_logprobs; scoring is unsupported on this engine (e.g. vllm-metal / MLX on macOS).")
|
||||
return backend_pb2.ScoreResponse()
|
||||
|
||||
prompt_logprobs = last_output.prompt_logprobs
|
||||
|
||||
@@ -43,6 +43,24 @@ if [ "x${BUILD_PROFILE}" == "xcublas13" ]; then
|
||||
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
|
||||
fi
|
||||
|
||||
# Apple Silicon (Metal/MLX) via vllm-metal.
|
||||
# vllm-metal (github.com/vllm-project/vllm-metal) brings vLLM to macOS on Apple
|
||||
# Silicon: it registers through vLLM's platform-plugin entry point
|
||||
# (metal -> vllm_metal:register), MetalPlatform activates, and the vLLM v1
|
||||
# AsyncLLM engine runs on the GPU through MLX. LocalAI's backend.py is UNCHANGED
|
||||
# on darwin — AsyncEngineArgs(...) -> AsyncLLMEngine.from_engine_args transparently
|
||||
# resolves to the MLX engine (proven on a real M4 / macOS 26.5 against Qwen3-0.6B).
|
||||
#
|
||||
# vllm-metal REQUIRES Python 3.12, so force the portable CPython before the venv
|
||||
# is created (ensureVenv reads PYTHON_VERSION/PYTHON_PATCH/PY_STANDALONE_TAG).
|
||||
# The patch + standalone tag mirror the l4t13 cp312 pin — a known-good
|
||||
# python-build-standalone release that also ships an aarch64-apple-darwin asset.
|
||||
if [ "$(uname -s)" = "Darwin" ]; then
|
||||
PYTHON_VERSION="3.12"
|
||||
PYTHON_PATCH="12"
|
||||
PY_STANDALONE_TAG="20251120"
|
||||
fi
|
||||
|
||||
# JetPack 7 / L4T arm64 vllm + torch wheels come straight from PyPI now
|
||||
# (torch 2.11+ ships aarch64 + cu130 manylinux wheels and vllm 0.20+ ships
|
||||
# an aarch64 wheel pinned to that torch). They're cp312-only, so bump the
|
||||
@@ -57,11 +75,87 @@ if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
|
||||
PY_STANDALONE_TAG="20251120"
|
||||
fi
|
||||
|
||||
# ===================== Apple Silicon (Metal/MLX) =====================
|
||||
# Reproduce vllm-metal's upstream installer
|
||||
# (curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh)
|
||||
# but INTO LocalAI's managed venv (ensureVenv) instead of a throwaway
|
||||
# ~/.venv-vllm-metal, so the backend integrates with LocalAI's venv lifecycle
|
||||
# (portable CPython, _makeVenvPortable relocation, runtime activation). The
|
||||
# normal CUDA/CPU installRequirements is skipped on darwin — there is no
|
||||
# macOS/arm64 vLLM wheel on PyPI; vLLM is built from source and the MLX engine
|
||||
# is layered on by the vllm-metal wheel.
|
||||
if [ "$(uname -s)" = "Darwin" ]; then
|
||||
# Create/activate the portable 3.12 venv. On darwin USE_PIP=true and
|
||||
# PORTABLE_PYTHON=true (set by scripts/build/python-darwin.sh), so this is a
|
||||
# `python -m venv` based, relocatable venv.
|
||||
ensureVenv
|
||||
|
||||
# vllm-metal's installer drives everything through `uv`: building vLLM from
|
||||
# the CPU requirements needs `--index-strategy unsafe-best-match` (mixes the
|
||||
# pytorch CPU channel with PyPI), a flag plain pip does not have. The darwin
|
||||
# venv is pip-based, so bootstrap uv into it. uv honours $VIRTUAL_ENV (set by
|
||||
# libbackend's _activateVenv) and installs into THIS venv — same pattern the
|
||||
# intel branch below relies on.
|
||||
pip install uv
|
||||
|
||||
# The ONLY darwin version pin -- AUTO-BUMPED by .github/bump_vllm_metal.sh,
|
||||
# which tracks vllm-project/vllm-metal releases (NOT vllm/vllm latest). Keep
|
||||
# it as a plain double-quoted assignment on its own line so the bumper's sed
|
||||
# can rewrite it. Darwin therefore follows vllm-metal and can lag the Linux
|
||||
# vllm pin (requirements-cublas13-after.txt, bumped independently against
|
||||
# vllm/vllm) until vllm-metal supports a newer vLLM.
|
||||
VLLM_METAL_VERSION="v0.3.0.dev20260622062346"
|
||||
|
||||
# The coupled vLLM source version is whatever this vllm-metal release builds
|
||||
# against -- it declares it in its own installer as `vllm_v=`. Derive it from
|
||||
# the PINNED tag rather than hardcoding a second value that could drift. The
|
||||
# tag is immutable, so this stays reproducible across rebuilds.
|
||||
VLLM_VERSION=$(curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm-metal/${VLLM_METAL_VERSION}/install.sh" \
|
||||
| grep -oE 'vllm_v="[0-9]+\.[0-9]+\.[0-9]+"' | head -n1 | cut -d'"' -f2)
|
||||
if [ -z "${VLLM_VERSION}" ]; then
|
||||
echo "ERROR: could not derive the vLLM version from vllm-metal ${VLLM_METAL_VERSION}" >&2
|
||||
exit 1
|
||||
fi
|
||||
echo "vllm-metal ${VLLM_METAL_VERSION} builds against vLLM ${VLLM_VERSION}"
|
||||
|
||||
_vllm_src=$(mktemp -d)
|
||||
trap 'rm -rf "${_vllm_src}"' EXIT
|
||||
pushd "${_vllm_src}"
|
||||
# 1) Build vLLM ${VLLM_VERSION} from the release source tarball against
|
||||
# the CPU requirements. vllm-metal layers its MLX platform plugin on
|
||||
# top of this exact build.
|
||||
curl -fsSL -o "vllm-${VLLM_VERSION}.tar.gz" \
|
||||
"https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}.tar.gz"
|
||||
tar -xzf "vllm-${VLLM_VERSION}.tar.gz"
|
||||
pushd "vllm-${VLLM_VERSION}"
|
||||
uv pip install -r requirements/cpu.txt --index-strategy unsafe-best-match
|
||||
# -Wno-parentheses: clang on macOS treats one of vLLM's C++ warnings
|
||||
# as an error without it (matches the upstream installer's CXXFLAGS).
|
||||
CXXFLAGS="-Wno-parentheses" uv pip install .
|
||||
popd
|
||||
popd
|
||||
|
||||
# 2) Install the prebuilt vllm-metal wheel for the PINNED release. It pulls
|
||||
# mlx / mlx-metal as deps and registers the `metal` platform plugin that
|
||||
# backend.py resolves to at engine-init time. Build the release-asset URL
|
||||
# deterministically (tag + the cp312/arm64 wheel name) rather than querying
|
||||
# api.github.com, whose unauthenticated rate limit (60/hr per IP) 403s on
|
||||
# shared CI runners. The wheel version is the tag without its leading 'v'.
|
||||
_metal_wheel="vllm_metal-${VLLM_METAL_VERSION#v}-cp312-cp312-macosx_11_0_arm64.whl"
|
||||
_metal_wheel_url="https://github.com/vllm-project/vllm-metal/releases/download/${VLLM_METAL_VERSION}/${_metal_wheel}"
|
||||
echo "Installing vllm-metal wheel: ${_metal_wheel_url}"
|
||||
uv pip install "${_metal_wheel_url}"
|
||||
|
||||
# Generate the gRPC stubs (backend_pb2*). installRequirements normally does
|
||||
# this via runProtogen at the end; we skipped installRequirements on darwin,
|
||||
# so call it explicitly here.
|
||||
runProtogen
|
||||
|
||||
# Intel XPU has no upstream-published vllm wheels, so we always build vllm
|
||||
# from source against torch-xpu and replace the default triton with
|
||||
# triton-xpu (matching torch 2.11). Mirrors the upstream procedure:
|
||||
# https://github.com/vllm-project/vllm/blob/main/docs/getting_started/installation/gpu.xpu.inc.md
|
||||
if [ "x${BUILD_TYPE}" == "xintel" ]; then
|
||||
elif [ "x${BUILD_TYPE}" == "xintel" ]; then
|
||||
# Hide requirements-intel-after.txt so installRequirements doesn't
|
||||
# try `pip install vllm` (would either fail or grab a non-XPU wheel).
|
||||
_intel_after="${backend_dir}/requirements-intel-after.txt"
|
||||
|
||||
@@ -4,4 +4,7 @@
|
||||
# instead — the cublas13 case in install.sh adds --index-strategy=unsafe-best-match
|
||||
# so uv consults this index alongside PyPI.
|
||||
--extra-index-url https://wheels.vllm.ai/0.23.0/cu130
|
||||
# VERSION COUPLING: darwin/Apple-Silicon builds use vllm-metal (see install.sh),
|
||||
# which pins this exact vLLM version. Bumping vllm here means coordinating with a
|
||||
# vllm-metal release that supports the new version, or macOS/Metal builds break.
|
||||
vllm==0.23.0
|
||||
|
||||
Reference in New Issue
Block a user