feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map (#9563)

* feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map

LocalAI's vLLM backend wraps a small typed subset of vLLM's
AsyncEngineArgs (quantization, tensor_parallel_size, dtype, etc.).
Anything outside that subset -- pipeline/data/expert parallelism,
speculative_config, kv_transfer_config, all2all_backend, prefix
caching, chunked prefill, etc. -- requires a new protobuf field, a
Go struct field, an options.go line, and a backend.py mapping per
feature. That cadence is the bottleneck on shipping vLLM's
production feature set.

Add a generic `engine_args:` map on the model YAML that is
JSON-serialised into a new ModelOptions.EngineArgs proto field and
applied verbatim to AsyncEngineArgs at LoadModel time. Validation
is done by the Python backend via dataclasses.fields(); unknown
keys fail with the closest valid name as a hint.
dataclasses.replace() is used so vLLM's __post_init__ re-runs and
auto-converts dict values into nested config dataclasses
(CompilationConfig, AttentionConfig, ...). speculative_config and
kv_transfer_config flow through as dicts; vLLM converts them at
engine init.

Operators can now write:

  engine_args:
    data_parallel_size: 8
    enable_expert_parallel: true
    all2all_backend: deepep_low_latency
    speculative_config:
      method: deepseek_mtp
      num_speculative_tokens: 3
    kv_cache_dtype: fp8

without further proto/Go/Python plumbing per field.

Production defaults seeded by hooks_vllm.go: enable_prefix_caching
and enable_chunked_prefill default to true unless explicitly set.

Existing typed YAML fields (gpu_memory_utilization,
tensor_parallel_size, etc.) remain for back-compat; engine_args
overrides them when both are set.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(vllm): pin cublas13 to vLLM 0.20.0 cu130 wheel

vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't
load on a cu130 host. Switch the cublas13 build to vLLM's per-tag cu130
simple-index (https://wheels.vllm.ai/0.20.0/cu130/) and pin
vllm==0.20.0. The cu130-flavoured wheel ships libcudart.so.13 and
includes the DFlash speculative-decoding method that landed in 0.20.0.

cublas13 install gets --index-strategy=unsafe-best-match so uv consults
both the cu130 index and PyPI when resolving — PyPI also publishes
vllm==0.20.0, but with cu12 binaries that error at import time.

Verified: Qwen3.5-4B + z-lab/Qwen3.5-4B-DFlash loads and serves chat
completions on RTX 5070 Ti (sm_120, cu130).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* ci(vllm): bot job to bump cublas13 vLLM wheel pin

vLLM's cu130 wheel index URL is itself version-locked
(wheels.vllm.ai/<TAG>/cu130/, no /latest/ alias upstream), so a vLLM
bump means rewriting two values atomically — the URL segment and the
version constraint. bump_deps.sh handles git-sha-in-Makefile only;
add a sibling bump_vllm_wheel.sh and a matching workflow job that
mirrors the existing matrix's PR-creation pattern.

The bumper queries /releases/latest (which excludes prereleases),
strips the leading 'v', and seds both lines unconditionally. When the
file is already on the latest tag the rewrite is a no-op and
peter-evans/create-pull-request opens no PR.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* docs(vllm): document engine_args and speculative decoding

The new engine_args: map plumbs arbitrary AsyncEngineArgs through to
vLLM, but the public docs only covered the basic typed fields. Add a
short subsection in the vLLM section explaining the typed/generic
split and showing a worked DFlash speculative-decoding config, with
pointers to vLLM's SpeculativeConfig reference and z-lab's drafter
collection.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
This commit is contained in:
Richard Palethorpe
2026-04-28 23:49:28 +01:00
committed by GitHub
parent 55afda22e3
commit 4916f8c880
14 changed files with 394 additions and 6 deletions

View File

@@ -1,5 +1,7 @@
#!/usr/bin/env python3
import asyncio
import dataclasses
import difflib
from concurrent import futures
import argparse
import signal
@@ -101,6 +103,36 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
opts[key.strip()] = value.strip()
return opts
def _apply_engine_args(self, engine_args, engine_args_json):
"""Apply user-supplied engine_args (JSON object) onto an AsyncEngineArgs.
Returns a new AsyncEngineArgs with the typed fields preserved and the
user's overrides layered on top. Uses ``dataclasses.replace`` so vLLM's
``__post_init__`` re-runs and auto-converts dict-valued fields like
``compilation_config`` / ``attention_config`` into their dataclass form.
``speculative_config`` and ``kv_transfer_config`` are accepted as dicts
directly (vLLM converts them at engine init).
Unknown keys raise ValueError with the closest valid field as a hint.
"""
if not engine_args_json:
return engine_args
try:
extra = json.loads(engine_args_json)
except json.JSONDecodeError as e:
raise ValueError(f"engine_args is not valid JSON: {e}") from e
if not isinstance(extra, dict):
raise ValueError(
f"engine_args must be a JSON object, got {type(extra).__name__}"
)
valid = {f.name for f in dataclasses.fields(type(engine_args))}
for key in extra:
if key not in valid:
suggestion = difflib.get_close_matches(key, valid, n=1)
hint = f" did you mean {suggestion[0]!r}?" if suggestion else ""
raise ValueError(f"unknown engine_args key {key!r}.{hint}")
return dataclasses.replace(engine_args, **extra)
def _messages_to_dicts(self, messages):
"""Convert proto Messages to list of dicts suitable for apply_chat_template()."""
result = []
@@ -176,6 +208,15 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
"audio": max(request.LimitAudioPerPrompt, 1)
}
# engine_args from YAML overrides typed fields above so operators can
# tune anything the AsyncEngineArgs dataclass exposes without waiting
# on protobuf changes.
try:
engine_args = self._apply_engine_args(engine_args, request.EngineArgs)
except ValueError as err:
print(f"engine_args error: {err}", file=sys.stderr)
return backend_pb2.Result(success=False, message=str(err))
try:
self.llm = AsyncLLMEngine.from_engine_args(engine_args)
except Exception as err:

View File

@@ -32,6 +32,14 @@ if [ "x${BUILD_PROFILE}" == "xcpu" ]; then
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
fi
# cublas13 pulls the vLLM wheel from a per-tag cu130 index (PyPI's vllm wheel
# is built against CUDA 12 and won't load on cu130). uv's default per-package
# first-match strategy would still pick the PyPI wheel, so allow it to consult
# every configured index when resolving.
if [ "x${BUILD_PROFILE}" == "xcublas13" ]; then
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
fi
# JetPack 7 / L4T arm64 wheels (torch, vllm, flash-attn) live on
# pypi.jetson-ai-lab.io and are built for cp312, so bump the venv Python
# accordingly. JetPack 6 keeps cp310 + USE_PIP=true. unsafe-best-match

View File

@@ -1,2 +1,7 @@
--extra-index-url https://download.pytorch.org/whl/cu130
vllm
# vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't load
# on a cu130 host. Pull the cu130-flavoured wheel from vLLM's per-tag index
# instead — the cublas13 case in install.sh adds --index-strategy=unsafe-best-match
# so uv consults this index alongside PyPI.
--extra-index-url https://wheels.vllm.ai/0.20.0/cu130
vllm==0.20.0

View File

@@ -168,6 +168,58 @@ class TestBackendServicer(unittest.TestCase):
self.assertEqual(opts["key_with_colons"], "a:b:c")
self.assertNotIn("invalid_no_colon", opts)
def test_apply_engine_args_known_keys(self):
"""
Tests _apply_engine_args overlays user-supplied JSON onto AsyncEngineArgs.
"""
import sys, os, json as _json
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from backend import BackendServicer
from vllm.engine.arg_utils import AsyncEngineArgs
servicer = BackendServicer()
base = AsyncEngineArgs(model="facebook/opt-125m")
extras = _json.dumps({
"trust_remote_code": True,
"max_num_seqs": 32,
})
out = servicer._apply_engine_args(base, extras)
self.assertTrue(out.trust_remote_code)
self.assertEqual(out.max_num_seqs, 32)
# untouched fields preserved
self.assertEqual(out.model, "facebook/opt-125m")
def test_apply_engine_args_unknown_key_raises(self):
"""
Tests _apply_engine_args rejects unknown keys with a helpful suggestion.
"""
import sys, os, json as _json
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from backend import BackendServicer
from vllm.engine.arg_utils import AsyncEngineArgs
servicer = BackendServicer()
base = AsyncEngineArgs(model="facebook/opt-125m")
with self.assertRaises(ValueError) as ctx:
servicer._apply_engine_args(base, _json.dumps({"trustremotecode": True}))
self.assertIn("trustremotecode", str(ctx.exception))
# close-match hint for the typo
self.assertIn("trust_remote_code", str(ctx.exception))
def test_apply_engine_args_empty_passthrough(self):
"""
Tests that empty engine_args returns the base unchanged.
"""
import sys, os
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from backend import BackendServicer
from vllm.engine.arg_utils import AsyncEngineArgs
servicer = BackendServicer()
base = AsyncEngineArgs(model="facebook/opt-125m")
self.assertIs(servicer._apply_engine_args(base, ""), base)
self.assertIs(servicer._apply_engine_args(base, None), base)
def test_tokenize_string(self):
"""
Tests the TokenizeString RPC returns valid tokens.