chore(vllm): derive the darwin vLLM version, drop the second pin

Follow-up: VLLM_VERSION was still a hardcoded string duplicating what VLLM_METAL_VERSION already determines. Derive it at install time from vllm-metal's own installer (vllm_v=) at the pinned tag - one source of truth, no second value to drift. The bumper now touches only VLLM_METAL_VERSION; the derivation is immutable per tag, so builds stay reproducible. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code]
chore(vllm): track the darwin vllm-metal pin via the autobumper
2026-06-24 16:49:06 -04:00 · 2026-06-24 20:25:07 +00:00 · 2026-06-24 20:03:14 +00:00 · 2026-06-24 17:17:50 +00:00
8 changed files with 234 additions and 151 deletions
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
@@ -4974,6 +4974,13 @@ includeDarwin:
  - backend: "kitten-tts"
    tag-suffix: "-metal-darwin-arm64-kitten-tts"
    build-type: "mps"
+  # vLLM on Apple Silicon via vllm-metal (MLX). The install is custom
+  # (backend/python/vllm/install.sh has a darwin branch); lang stays python so
+  # backend_build_darwin.yml drives it through build-darwin-python-backend ->
+  # scripts/build/python-darwin.sh, which runs the backend's install.sh.
+  - backend: "vllm"
+    tag-suffix: "-metal-darwin-arm64-vllm"
+    build-type: "mps"
  - backend: "piper"
    tag-suffix: "-metal-darwin-arm64-piper"
    build-type: "metal"
--- a/.github/bump_vllm_metal.sh
+++ b/.github/bump_vllm_metal.sh
@@ -0,0 +1,55 @@
+#!/bin/bash
+# Bump the single vllm-metal pin (VLLM_METAL_VERSION) in the vLLM backend's
+# darwin (Apple Silicon) install path. The macOS/Metal build
+# (backend/python/vllm/install.sh, Darwin branch) installs vllm-metal, which is
+# version-locked to a specific vLLM source release. install.sh derives that vLLM
+# version at build time from vllm-metal's own installer (`vllm_v=`) at the pinned
+# tag, so there is only ONE value to bump here -- mirroring bump_vllm_wheel.sh,
+# which bumps the Linux cu130 wheel pin.
+#
+# This deliberately tracks vllm-project/vllm-metal, NOT vllm-project/vllm: the
+# darwin build can only use the exact vLLM version vllm-metal supports, so it may
+# lag the Linux pin (requirements-cublas13-after.txt) until vllm-metal catches up.
+set -xe
+REPO=$1   # vllm-project/vllm-metal
+FILE=$2   # backend/python/vllm/install.sh
+VAR=$3    # VLLM_METAL_VERSION (used for the workflow's output file names)
+
+if [ -z "$FILE" ] || [ -z "$REPO" ] || [ -z "$VAR" ]; then
+    echo "usage: $0 <repo> <install-file> <var-name>" >&2
+    exit 1
+fi
+
+# vllm-metal ships frequent dev releases, all flagged as non-prerelease, so
+# /releases/latest returns the newest one (with its cp312 wheel asset).
+LATEST_TAG=$(curl -sS -H "Accept: application/vnd.github+json" \
+    "https://api.github.com/repos/$REPO/releases/latest" \
+    | python3 -c "import json,sys; print(json.load(sys.stdin)['tag_name'])")
+
+# The coupled vLLM source version lives in vllm-metal's installer at that tag.
+NEW_VLLM_VERSION=$(curl -fsSL \
+    "https://raw.githubusercontent.com/$REPO/$LATEST_TAG/install.sh" \
+    | grep -oE 'vllm_v="[0-9]+\.[0-9]+\.[0-9]+"' | head -1 | cut -d'"' -f2)
+
+if [ -z "$LATEST_TAG" ] || [ -z "$NEW_VLLM_VERSION" ]; then
+    echo "Could not resolve vllm-metal tag ($LATEST_TAG) or its vllm_v ($NEW_VLLM_VERSION)." >&2
+    exit 1
+fi
+
+set +e
+CURRENT_TAG=$(grep -oE 'VLLM_METAL_VERSION="[^"]*"' "$FILE" | head -1 | cut -d'"' -f2)
+set -e
+
+# Rewrite the single pin. install.sh derives VLLM_VERSION from this tag at build
+# time, so there is nothing else to touch. peter-evans/create-pull-request opens
+# no PR on a clean tree, so a no-op rewrite (already current) is safe.
+sed -i "$FILE" \
+    -e "s|VLLM_METAL_VERSION=\"[^\"]*\"|VLLM_METAL_VERSION=\"$LATEST_TAG\"|"
+
+if [ -z "$CURRENT_TAG" ]; then
+    echo "Could not find VLLM_METAL_VERSION=\"...\" in $FILE." >&2
+    exit 0
+fi
+
+echo "vllm-metal ${CURRENT_TAG} -> ${LATEST_TAG} (builds vLLM ${NEW_VLLM_VERSION}): https://github.com/$REPO/releases/tag/${LATEST_TAG}" >> "${VAR}_message.txt"
+echo "${LATEST_TAG}" >> "${VAR}_commit.txt"
--- a/.github/workflows/bump_deps.yaml
+++ b/.github/workflows/bump_deps.yaml
@@ -154,3 +154,39 @@ jobs:
          branch: "update/VLLM_VERSION"
          body: ${{ steps.bump.outputs.message }}
          signoff: true
+
+  bump-vllm-metal:
+    # The darwin (Apple Silicon) vLLM build installs vllm-metal, which is locked
+    # to a specific vLLM source release. install.sh pins both VLLM_METAL_VERSION
+    # (the wheel release) and VLLM_VERSION (the vLLM it builds against); this job
+    # tracks vllm-project/vllm-metal and rewrites both atomically. Separate from
+    # bump-vllm-wheel because darwin follows vllm-metal, not vllm/vllm latest.
+    if: github.repository == 'mudler/LocalAI'
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v7
+      - name: Bump vllm-metal pin 🔧
+        id: bump
+        run: |
+          bash .github/bump_vllm_metal.sh vllm-project/vllm-metal backend/python/vllm/install.sh VLLM_METAL_VERSION
+          {
+            echo 'message<<EOF'
+            cat "VLLM_METAL_VERSION_message.txt"
+            echo EOF
+          } >> "$GITHUB_OUTPUT"
+          {
+            echo 'commit<<EOF'
+            cat "VLLM_METAL_VERSION_commit.txt"
+            echo EOF
+          } >> "$GITHUB_OUTPUT"
+          rm -rfv VLLM_METAL_VERSION_message.txt VLLM_METAL_VERSION_commit.txt
+      - name: Create Pull Request
+        uses: peter-evans/create-pull-request@v8
+        with:
+          token: ${{ secrets.UPDATE_BOT_TOKEN }}
+          push-to-fork: ci-forks/LocalAI
+          commit-message: ':arrow_up: Update vllm-project/vllm-metal (darwin)'
+          title: 'chore: :arrow_up: Update vllm-metal (darwin) to `${{ steps.bump.outputs.commit }}`'
+          branch: "update/VLLM_METAL_VERSION"
+          body: ${{ steps.bump.outputs.message }}
+          signoff: true
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -37,7 +37,6 @@
 #include "backend.pb.h"
 #include "backend.grpc.pb.h"
 #include "common.h"
-#include "arg.h"
 #include "chat-auto-parser.h"
 #include <getopt.h>
 #include <grpcpp/ext/proto_server_reflection_plugin.h>
@@ -593,10 +592,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
    params.checkpoint_min_step = 256;
 #endif

-    // Raw upstream llama-server flags collected from any option entry that
-    // starts with '-'. Applied once after the loop via common_params_parse.
-    std::vector<std::string> extra_argv;
-
     // decode options. Options are in form optname:optvale, or if booleans only optname.
    for (int i = 0; i < request->options_size(); i++) {
        std::string opt = request->options(i);
@@ -1085,31 +1080,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
                } catch (...) {}
            }

-        // --- main model MoE on CPU (upstream --cpu-moe / --n-cpu-moe) ---
-        } else if (!strcmp(optname, "cpu_moe")) {
-            // Bool-style flag: keep all MoE expert weights on CPU.
-            const bool enable = (optval == NULL) ||
-                optval_str == "true" || optval_str == "1" || optval_str == "yes" ||
-                optval_str == "on" || optval_str == "enabled";
-            if (enable) {
-                params.tensor_buft_overrides.push_back(llm_ffn_exps_cpu_override());
-            }
-        } else if (!strcmp(optname, "n_cpu_moe")) {
-            if (optval != NULL) {
-                try {
-                    int n = std::stoi(optval_str);
-                    if (n < 0) n = 0;
-                    // Keep override-name storage alive for the lifetime of the
-                    // params struct (mirrors upstream arg.cpp's function-local static).
-                    static std::list<std::string> buft_overrides_main;
-                    for (int i = 0; i < n; ++i) {
-                        buft_overrides_main.push_back(llm_ffn_exps_block_regex(i));
-                        params.tensor_buft_overrides.push_back(
-                            {buft_overrides_main.back().c_str(), ggml_backend_cpu_buffer_type()});
-                    }
-                } catch (...) {}
-            }
-
        // --- draft model tensor buffer overrides (upstream --spec-draft-override-tensor) ---
        } else if (!strcmp(optname, "draft_override_tensor") || !strcmp(optname, "spec_draft_override_tensor")) {
            // Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...
@@ -1141,30 +1111,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
                else { cur.push_back(c); }
            }
            if (!cur.empty()) flush(cur);
-
-        // --- generic passthrough: any entry starting with '-' is a raw
-        //     upstream llama-server flag, forwarded verbatim to the parser. ---
-        } else if (optname[0] == '-') {
-            std::string flag = optname;
-            // These flags make upstream's parser exit() (printing usage /
-            // completion), which would kill the backend process. Skip them.
-            if (flag == "-h" || flag == "--help" || flag == "--usage" ||
-                flag == "--version" || flag == "--license" ||
-                flag == "--list-devices" || flag == "-cl" ||
-                flag == "--cache-list" ||
-                flag.rfind("--completion", 0) == 0) {
-                fprintf(stderr,
-                    "[llama-cpp] ignoring passthrough flag that would exit: %s\n",
-                    flag.c_str());
-            } else {
-                extra_argv.push_back(flag);
-                // Preserve the whole value after the first ':' so embedded
-                // colons (e.g. host:port) survive strtok's truncation of optval.
-                auto colon = opt.find(':');
-                if (colon != std::string::npos) {
-                    extra_argv.push_back(opt.substr(colon + 1));
-                }
-            }
        }
    }

@@ -1200,6 +1146,27 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
        }
    }

+    if (!params.kv_overrides.empty()) {
+        params.kv_overrides.emplace_back();
+        params.kv_overrides.back().key[0] = 0;
+    }
+
+    // tensor_buft_overrides sentinel termination (mirrors upstream common/arg.cpp).
+    // Real entries are pushed during option parsing; here we pad/terminate so the
+    // model loader sees back().pattern == nullptr (GGML_ASSERT at common.cpp:1543)
+    // and so llama_params_fit has the placeholder slots it requires.
+    {
+        const size_t ntbo = llama_max_tensor_buft_overrides();
+        while (params.tensor_buft_overrides.size() < ntbo) {
+            params.tensor_buft_overrides.push_back({nullptr, nullptr});
+        }
+    }
+    // Terminate the draft tensor_buft_overrides list with a sentinel, mirroring
+    // the main-model handling above.
+    if (!params.speculative.draft.tensor_buft_overrides.empty()) {
+        params.speculative.draft.tensor_buft_overrides.push_back({nullptr, nullptr});
+    }
+
    // TODO: Add yarn

    if (!request->tensorsplit().empty()) {
@@ -1292,69 +1259,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            params.sampling.grammar_triggers.push_back(std::move(trigger));
        }
    }
-
-    // Apply any raw upstream flags last so an explicit passthrough flag wins
-    // over the LocalAI-resolved field it maps to (e.g. --ctx-size beats
-    // context_size). This is the same parser llama-server itself uses.
-    if (!extra_argv.empty()) {
-        // common_params_parser_init resets a few fields for the SERVER example
-        // (n_parallel -> -1, use_color). Snapshot n_parallel so an unrelated
-        // passthrough flag can't silently clobber LocalAI's resolved value.
-        const int saved_n_parallel = params.n_parallel;
-
-        std::vector<char *> argv;
-        std::string prog = "llama-server";
-        argv.push_back(prog.data());
-        for (auto & a : extra_argv) {
-            argv.push_back(a.data());
-        }
-
-        // ctx_arg.params is a reference, so this overlays the given flags onto
-        // `params` in place. Returns false on a recoverable parse error (and
-        // self-restores params); may exit() on a hard error, exactly as
-        // passing the same bad flag to llama-server would.
-        if (!common_params_parse((int)argv.size(), argv.data(), params,
-                                 LLAMA_EXAMPLE_SERVER)) {
-            fprintf(stderr,
-                "[llama-cpp] failed to parse passthrough options; ignoring them\n");
-        }
-
-        // Restore n_parallel unless a passthrough flag explicitly set it
-        // (parser_init's reset sentinel for SERVER is -1).
-        if (params.n_parallel == -1) {
-            params.n_parallel = saved_n_parallel;
-        }
-    }
-
-    // Terminate/pad the override vectors only after BOTH the named-option loop
-    // and the generic passthrough (common_params_parse above) have pushed their
-    // real entries, so back() is the null sentinel the model loader asserts on.
-    // Running these before the passthrough let a passthrough flag (--cpu-moe,
-    // --override-tensor, --override-kv, ...) append a real entry after the
-    // sentinel: a GGML_ASSERT crash for tensor_buft_overrides, a silent drop for
-    // kv_overrides. Double-termination is harmless (the while is a no-op if the
-    // passthrough parse already padded; an extra trailing null is ignored).
-
-    if (!params.kv_overrides.empty()) {
-        params.kv_overrides.emplace_back();
-        params.kv_overrides.back().key[0] = 0;
-    }
-
-    // tensor_buft_overrides sentinel termination (mirrors upstream common/arg.cpp).
-    // Real entries are pushed during option parsing; here we pad/terminate so the
-    // model loader sees back().pattern == nullptr (GGML_ASSERT at common.cpp:1543)
-    // and so llama_params_fit has the placeholder slots it requires.
-    {
-        const size_t ntbo = llama_max_tensor_buft_overrides();
-        while (params.tensor_buft_overrides.size() < ntbo) {
-            params.tensor_buft_overrides.push_back({nullptr, nullptr});
-        }
-    }
-    // Terminate the draft tensor_buft_overrides list with a sentinel, mirroring
-    // the main-model handling above.
-    if (!params.speculative.draft.tensor_buft_overrides.empty()) {
-        params.speculative.draft.tensor_buft_overrides.push_back({nullptr, nullptr});
-    }
 }


--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -645,6 +645,7 @@
    nvidia-cuda-13: "cuda13-vllm"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm"
    cpu: "cpu-vllm"
+    metal: "metal-vllm"
 - &sglang
  name: "sglang"
  license: apache-2.0
@@ -2927,6 +2928,17 @@
    nvidia-cuda-13: "cuda13-vllm-development"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm-development"
    cpu: "cpu-vllm-development"
+    metal: "metal-vllm-development"
+- !!merge <<: *vllm
+  name: "metal-vllm"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-vllm"
+  mirrors:
+    - localai/localai-backends:latest-metal-darwin-arm64-vllm
+- !!merge <<: *vllm
+  name: "metal-vllm-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-vllm"
+  mirrors:
+    - localai/localai-backends:master-metal-darwin-arm64-vllm
 - !!merge <<: *vllm
  name: "cuda12-vllm"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-vllm"
--- a/backend/python/vllm/install.sh
+++ b/backend/python/vllm/install.sh
@@ -43,6 +43,24 @@ if [ "x${BUILD_PROFILE}" == "xcublas13" ]; then
    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
 fi

+# Apple Silicon (Metal/MLX) via vllm-metal.
+# vllm-metal (github.com/vllm-project/vllm-metal) brings vLLM to macOS on Apple
+# Silicon: it registers through vLLM's platform-plugin entry point
+# (metal -> vllm_metal:register), MetalPlatform activates, and the vLLM v1
+# AsyncLLM engine runs on the GPU through MLX. LocalAI's backend.py is UNCHANGED
+# on darwin — AsyncEngineArgs(...) -> AsyncLLMEngine.from_engine_args transparently
+# resolves to the MLX engine (proven on a real M4 / macOS 26.5 against Qwen3-0.6B).
+#
+# vllm-metal REQUIRES Python 3.12, so force the portable CPython before the venv
+# is created (ensureVenv reads PYTHON_VERSION/PYTHON_PATCH/PY_STANDALONE_TAG).
+# The patch + standalone tag mirror the l4t13 cp312 pin — a known-good
+# python-build-standalone release that also ships an aarch64-apple-darwin asset.
+if [ "$(uname -s)" = "Darwin" ]; then
+    PYTHON_VERSION="3.12"
+    PYTHON_PATCH="12"
+    PY_STANDALONE_TAG="20251120"
+fi
+
 # JetPack 7 / L4T arm64 vllm + torch wheels come straight from PyPI now
 # (torch 2.11+ ships aarch64 + cu130 manylinux wheels and vllm 0.20+ ships
 # an aarch64 wheel pinned to that torch). They're cp312-only, so bump the
@@ -57,11 +75,92 @@ if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
    PY_STANDALONE_TAG="20251120"
 fi

+# ===================== Apple Silicon (Metal/MLX) =====================
+# Reproduce vllm-metal's upstream installer
+# (curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh)
+# but INTO LocalAI's managed venv (ensureVenv) instead of a throwaway
+# ~/.venv-vllm-metal, so the backend integrates with LocalAI's venv lifecycle
+# (portable CPython, _makeVenvPortable relocation, runtime activation). The
+# normal CUDA/CPU installRequirements is skipped on darwin — there is no
+# macOS/arm64 vLLM wheel on PyPI; vLLM is built from source and the MLX engine
+# is layered on by the vllm-metal wheel.
+if [ "$(uname -s)" = "Darwin" ]; then
+    # Create/activate the portable 3.12 venv. On darwin USE_PIP=true and
+    # PORTABLE_PYTHON=true (set by scripts/build/python-darwin.sh), so this is a
+    # `python -m venv` based, relocatable venv.
+    ensureVenv
+
+    # vllm-metal's installer drives everything through `uv`: building vLLM from
+    # the CPU requirements needs `--index-strategy unsafe-best-match` (mixes the
+    # pytorch CPU channel with PyPI), a flag plain pip does not have. The darwin
+    # venv is pip-based, so bootstrap uv into it. uv honours $VIRTUAL_ENV (set by
+    # libbackend's _activateVenv) and installs into THIS venv — same pattern the
+    # intel branch below relies on.
+    pip install uv
+
+    # The ONLY darwin version pin -- AUTO-BUMPED by .github/bump_vllm_metal.sh,
+    # which tracks vllm-project/vllm-metal releases (NOT vllm/vllm latest). Keep
+    # it as a plain double-quoted assignment on its own line so the bumper's sed
+    # can rewrite it. Darwin therefore follows vllm-metal and can lag the Linux
+    # vllm pin (requirements-cublas13-after.txt, bumped independently against
+    # vllm/vllm) until vllm-metal supports a newer vLLM.
+    VLLM_METAL_VERSION="v0.3.0.dev20260622062346"
+
+    # The coupled vLLM source version is whatever this vllm-metal release builds
+    # against -- it declares it in its own installer as `vllm_v=`. Derive it from
+    # the PINNED tag rather than hardcoding a second value that could drift. The
+    # tag is immutable, so this stays reproducible across rebuilds.
+    VLLM_VERSION=$(curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm-metal/${VLLM_METAL_VERSION}/install.sh" \
+        | grep -oE 'vllm_v="[0-9]+\.[0-9]+\.[0-9]+"' | head -n1 | cut -d'"' -f2)
+    if [ -z "${VLLM_VERSION}" ]; then
+        echo "ERROR: could not derive the vLLM version from vllm-metal ${VLLM_METAL_VERSION}" >&2
+        exit 1
+    fi
+    echo "vllm-metal ${VLLM_METAL_VERSION} builds against vLLM ${VLLM_VERSION}"
+
+    _vllm_src=$(mktemp -d)
+    trap 'rm -rf "${_vllm_src}"' EXIT
+    pushd "${_vllm_src}"
+        # 1) Build vLLM ${VLLM_VERSION} from the release source tarball against
+        #    the CPU requirements. vllm-metal layers its MLX platform plugin on
+        #    top of this exact build.
+        curl -fsSL -o "vllm-${VLLM_VERSION}.tar.gz" \
+            "https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}.tar.gz"
+        tar -xzf "vllm-${VLLM_VERSION}.tar.gz"
+        pushd "vllm-${VLLM_VERSION}"
+            uv pip install -r requirements/cpu.txt --index-strategy unsafe-best-match
+            # -Wno-parentheses: clang on macOS treats one of vLLM's C++ warnings
+            # as an error without it (matches the upstream installer's CXXFLAGS).
+            CXXFLAGS="-Wno-parentheses" uv pip install .
+        popd
+    popd
+
+    # 2) Install the prebuilt vllm-metal wheel from the PINNED release
+    #    (${VLLM_METAL_VERSION}). It pulls mlx / mlx-metal as deps and registers
+    #    the `metal` platform plugin that backend.py resolves to at engine-init
+    #    time. Pinning the tag (vs releases/latest) keeps the wheel and the vLLM
+    #    source build above reproducible and coupled; .github/bump_vllm_metal.sh
+    #    advances both together.
+    _metal_wheel_url=$(curl -fsSL "https://api.github.com/repos/vllm-project/vllm-metal/releases/tags/${VLLM_METAL_VERSION}" \
+        | grep -oE '"browser_download_url"[[:space:]]*:[[:space:]]*"[^"]+\.whl"' \
+        | head -n1 | sed -E 's/.*"(https[^"]+)".*/\1/')
+    if [ -z "${_metal_wheel_url}" ]; then
+        echo "ERROR: could not resolve a vllm-metal wheel URL for release ${VLLM_METAL_VERSION}" >&2
+        exit 1
+    fi
+    echo "Installing vllm-metal wheel: ${_metal_wheel_url}"
+    uv pip install "${_metal_wheel_url}"
+
+    # Generate the gRPC stubs (backend_pb2*). installRequirements normally does
+    # this via runProtogen at the end; we skipped installRequirements on darwin,
+    # so call it explicitly here.
+    runProtogen
+
 # Intel XPU has no upstream-published vllm wheels, so we always build vllm
 # from source against torch-xpu and replace the default triton with
 # triton-xpu (matching torch 2.11). Mirrors the upstream procedure:
 # https://github.com/vllm-project/vllm/blob/main/docs/getting_started/installation/gpu.xpu.inc.md
-if [ "x${BUILD_TYPE}" == "xintel" ]; then
+elif [ "x${BUILD_TYPE}" == "xintel" ]; then
    # Hide requirements-intel-after.txt so installRequirements doesn't
    # try `pip install vllm` (would either fail or grab a non-XPU wheel).
    _intel_after="${backend_dir}/requirements-intel-after.txt"
--- a/backend/python/vllm/requirements-cublas13-after.txt
+++ b/backend/python/vllm/requirements-cublas13-after.txt
@@ -4,4 +4,7 @@
 # instead — the cublas13 case in install.sh adds --index-strategy=unsafe-best-match
 # so uv consults this index alongside PyPI.
 --extra-index-url https://wheels.vllm.ai/0.23.0/cu130
+# VERSION COUPLING: darwin/Apple-Silicon builds use vllm-metal (see install.sh),
+# which pins this exact vLLM version. Bumping vllm here means coordinating with a
+# vllm-metal release that supports the new version, or macOS/Metal builds break.
 vllm==0.23.0
--- a/docs/content/advanced/model-configuration.md
+++ b/docs/content/advanced/model-configuration.md
@@ -494,39 +494,6 @@ These llama.cpp options are passed through the `options:` array.
 | `direct_io` / `use_direct_io` | bool | `false` | Open the model with `O_DIRECT` (faster cold loads on NVMe; ignored if not supported). |
 | `verbosity` | int | `3` | llama.cpp internal log verbosity threshold. Higher = more verbose. |
 | `override_tensor` / `tensor_buft_overrides` | string | "" | Per-tensor buffer-type overrides for the main model. Format: `<tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...`. Mirrors the existing `draft_override_tensor` syntax for the draft model. |
-| `cpu_moe` | bool | false | Keep all MoE expert weights of the main model on CPU (upstream `--cpu-moe`). Frees VRAM on large MoE models (DeepSeek, Qwen3 `*-A3B`). |
-| `n_cpu_moe` | int | 0 | Keep MoE expert weights of the first N main-model layers on CPU (upstream `--n-cpu-moe`). |
-
-#### Generic option passthrough
-
-Any `options:` entry whose name starts with `-` is forwarded **verbatim** to
-upstream llama.cpp's own `llama-server` argument parser. This means any flag the
-bundled llama.cpp supports works without LocalAI needing a dedicated option,
-even ones added after your LocalAI version was built. See the upstream
-[server flags reference](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md).
-
-Format mirrors the rest of the array - `--flag` for a boolean, or `--flag:value`
-for a flag that takes a value. Everything after the first `:` is the value, so
-embedded colons (e.g. `host:port`) are preserved:
-
-```yaml
-options:
-  - "--cpu-moe"                 # boolean flag
-  - "--n-cpu-moe:4"             # flag with a value
-  - "--override-tensor:exps=CPU"
-```
-
-Notes:
-
- **Precedence:** passthrough flags are applied last, so an explicit flag
-  overrides the LocalAI option it maps to (e.g. `--ctx-size:8192` overrides
-  `context_size`).
- **Power-user territory:** an invalid flag or value is rejected by the upstream
-  parser exactly as it would be by `llama-server`, which can fail model loading.
-  Prefer the named options above when one exists.
- Flags that would terminate the process (such as `--help`, `--usage`,
-  `--version`, `--license`, `--list-devices`, `--cache-list`, and
-  `--completion*`) are ignored.

 ### Prompt Caching