feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map (#9563)

* feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map LocalAI's vLLM backend wraps a small typed subset of vLLM's AsyncEngineArgs (quantization, tensor_parallel_size, dtype, etc.). Anything outside that subset -- pipeline/data/expert parallelism, speculative_config, kv_transfer_config, all2all_backend, prefix caching, chunked prefill, etc. -- requires a new protobuf field, a Go struct field, an options.go line, and a backend.py mapping per feature. That cadence is the bottleneck on shipping vLLM's production feature set. Add a generic `engine_args:` map on the model YAML that is JSON-serialised into a new ModelOptions.EngineArgs proto field and applied verbatim to AsyncEngineArgs at LoadModel time. Validation is done by the Python backend via dataclasses.fields(); unknown keys fail with the closest valid name as a hint. dataclasses.replace() is used so vLLM's __post_init__ re-runs and auto-converts dict values into nested config dataclasses (CompilationConfig, AttentionConfig, ...). speculative_config and kv_transfer_config flow through as dicts; vLLM converts them at engine init. Operators can now write: engine_args: data_parallel_size: 8 enable_expert_parallel: true all2all_backend: deepep_low_latency speculative_config: method: deepseek_mtp num_speculative_tokens: 3 kv_cache_dtype: fp8 without further proto/Go/Python plumbing per field. Production defaults seeded by hooks_vllm.go: enable_prefix_caching and enable_chunked_prefill default to true unless explicitly set. Existing typed YAML fields (gpu_memory_utilization, tensor_parallel_size, etc.) remain for back-compat; engine_args overrides them when both are set. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(vllm): pin cublas13 to vLLM 0.20.0 cu130 wheel vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't load on a cu130 host. Switch the cublas13 build to vLLM's per-tag cu130 simple-index (https://wheels.vllm.ai/0.20.0/cu130/) and pin vllm==0.20.0. The cu130-flavoured wheel ships libcudart.so.13 and includes the DFlash speculative-decoding method that landed in 0.20.0. cublas13 install gets --index-strategy=unsafe-best-match so uv consults both the cu130 index and PyPI when resolving — PyPI also publishes vllm==0.20.0, but with cu12 binaries that error at import time. Verified: Qwen3.5-4B + z-lab/Qwen3.5-4B-DFlash loads and serves chat completions on RTX 5070 Ti (sm_120, cu130). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * ci(vllm): bot job to bump cublas13 vLLM wheel pin vLLM's cu130 wheel index URL is itself version-locked (wheels.vllm.ai/<TAG>/cu130/, no /latest/ alias upstream), so a vLLM bump means rewriting two values atomically — the URL segment and the version constraint. bump_deps.sh handles git-sha-in-Makefile only; add a sibling bump_vllm_wheel.sh and a matching workflow job that mirrors the existing matrix's PR-creation pattern. The bumper queries /releases/latest (which excludes prereleases), strips the leading 'v', and seds both lines unconditionally. When the file is already on the latest tag the rewrite is a no-op and peter-evans/create-pull-request opens no PR. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * docs(vllm): document engine_args and speculative decoding The new engine_args: map plumbs arbitrary AsyncEngineArgs through to vLLM, but the public docs only covered the basic typed fields. Add a short subsection in the vLLM section explaining the typed/generic split and showing a worked DFlash speculative-decoding config, with pointers to vLLM's SpeculativeConfig reference and z-lab's drafter collection. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-30 12:08:13 -04:00 · 2026-04-28 23:49:28 +01:00
parent 55afda22e3
commit 4916f8c880
14 changed files with 394 additions and 6 deletions
--- a/backend/python/vllm/backend.py
+++ b/backend/python/vllm/backend.py
@@ -1,5 +1,7 @@
 #!/usr/bin/env python3
 import asyncio
+import dataclasses
+import difflib
 from concurrent import futures
 import argparse
 import signal
@@ -101,6 +103,36 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
            opts[key.strip()] = value.strip()
        return opts

+    def _apply_engine_args(self, engine_args, engine_args_json):
+        """Apply user-supplied engine_args (JSON object) onto an AsyncEngineArgs.
+
+        Returns a new AsyncEngineArgs with the typed fields preserved and the
+        user's overrides layered on top. Uses ``dataclasses.replace`` so vLLM's
+        ``__post_init__`` re-runs and auto-converts dict-valued fields like
+        ``compilation_config`` / ``attention_config`` into their dataclass form.
+        ``speculative_config`` and ``kv_transfer_config`` are accepted as dicts
+        directly (vLLM converts them at engine init).
+
+        Unknown keys raise ValueError with the closest valid field as a hint.
+        """
+        if not engine_args_json:
+            return engine_args
+        try:
+            extra = json.loads(engine_args_json)
+        except json.JSONDecodeError as e:
+            raise ValueError(f"engine_args is not valid JSON: {e}") from e
+        if not isinstance(extra, dict):
+            raise ValueError(
+                f"engine_args must be a JSON object, got {type(extra).__name__}"
+            )
+        valid = {f.name for f in dataclasses.fields(type(engine_args))}
+        for key in extra:
+            if key not in valid:
+                suggestion = difflib.get_close_matches(key, valid, n=1)
+                hint = f" did you mean {suggestion[0]!r}?" if suggestion else ""
+                raise ValueError(f"unknown engine_args key {key!r}.{hint}")
+        return dataclasses.replace(engine_args, **extra)
+
    def _messages_to_dicts(self, messages):
        """Convert proto Messages to list of dicts suitable for apply_chat_template()."""
        result = []
@@ -176,6 +208,15 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
                "audio": max(request.LimitAudioPerPrompt, 1)
            }

+        # engine_args from YAML overrides typed fields above so operators can
+        # tune anything the AsyncEngineArgs dataclass exposes without waiting
+        # on protobuf changes.
+        try:
+            engine_args = self._apply_engine_args(engine_args, request.EngineArgs)
+        except ValueError as err:
+            print(f"engine_args error: {err}", file=sys.stderr)
+            return backend_pb2.Result(success=False, message=str(err))
+
        try:
            self.llm = AsyncLLMEngine.from_engine_args(engine_args)
        except Exception as err:
--- a/backend/python/vllm/install.sh
+++ b/backend/python/vllm/install.sh
@@ -32,6 +32,14 @@ if [ "x${BUILD_PROFILE}" == "xcpu" ]; then
    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
 fi

+# cublas13 pulls the vLLM wheel from a per-tag cu130 index (PyPI's vllm wheel
+# is built against CUDA 12 and won't load on cu130). uv's default per-package
+# first-match strategy would still pick the PyPI wheel, so allow it to consult
+# every configured index when resolving.
+if [ "x${BUILD_PROFILE}" == "xcublas13" ]; then
+    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
+fi
+
 # JetPack 7 / L4T arm64 wheels (torch, vllm, flash-attn) live on
 # pypi.jetson-ai-lab.io and are built for cp312, so bump the venv Python
 # accordingly. JetPack 6 keeps cp310 + USE_PIP=true. unsafe-best-match
--- a/backend/python/vllm/requirements-cublas13-after.txt
+++ b/backend/python/vllm/requirements-cublas13-after.txt
@@ -1,2 +1,7 @@
 --extra-index-url https://download.pytorch.org/whl/cu130
-vllm
+# vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't load
+# on a cu130 host. Pull the cu130-flavoured wheel from vLLM's per-tag index
+# instead — the cublas13 case in install.sh adds --index-strategy=unsafe-best-match
+# so uv consults this index alongside PyPI.
+--extra-index-url https://wheels.vllm.ai/0.20.0/cu130
+vllm==0.20.0
--- a/backend/python/vllm/test.py
+++ b/backend/python/vllm/test.py
@@ -168,6 +168,58 @@ class TestBackendServicer(unittest.TestCase):
        self.assertEqual(opts["key_with_colons"], "a:b:c")
        self.assertNotIn("invalid_no_colon", opts)

+    def test_apply_engine_args_known_keys(self):
+        """
+        Tests _apply_engine_args overlays user-supplied JSON onto AsyncEngineArgs.
+        """
+        import sys, os, json as _json
+        sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+        from backend import BackendServicer
+        from vllm.engine.arg_utils import AsyncEngineArgs
+
+        servicer = BackendServicer()
+        base = AsyncEngineArgs(model="facebook/opt-125m")
+        extras = _json.dumps({
+            "trust_remote_code": True,
+            "max_num_seqs": 32,
+        })
+        out = servicer._apply_engine_args(base, extras)
+        self.assertTrue(out.trust_remote_code)
+        self.assertEqual(out.max_num_seqs, 32)
+        # untouched fields preserved
+        self.assertEqual(out.model, "facebook/opt-125m")
+
+    def test_apply_engine_args_unknown_key_raises(self):
+        """
+        Tests _apply_engine_args rejects unknown keys with a helpful suggestion.
+        """
+        import sys, os, json as _json
+        sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+        from backend import BackendServicer
+        from vllm.engine.arg_utils import AsyncEngineArgs
+
+        servicer = BackendServicer()
+        base = AsyncEngineArgs(model="facebook/opt-125m")
+        with self.assertRaises(ValueError) as ctx:
+            servicer._apply_engine_args(base, _json.dumps({"trustremotecode": True}))
+        self.assertIn("trustremotecode", str(ctx.exception))
+        # close-match hint for the typo
+        self.assertIn("trust_remote_code", str(ctx.exception))
+
+    def test_apply_engine_args_empty_passthrough(self):
+        """
+        Tests that empty engine_args returns the base unchanged.
+        """
+        import sys, os
+        sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+        from backend import BackendServicer
+        from vllm.engine.arg_utils import AsyncEngineArgs
+
+        servicer = BackendServicer()
+        base = AsyncEngineArgs(model="facebook/opt-125m")
+        self.assertIs(servicer._apply_engine_args(base, ""), base)
+        self.assertIs(servicer._apply_engine_args(base, None), base)
+
    def test_tokenize_string(self):
        """
        Tests the TokenizeString RPC returns valid tokens.