mirror of
https://github.com/mudler/LocalAI.git
synced 2026-04-29 03:24:49 -04:00
feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map (#9563)
* feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map
LocalAI's vLLM backend wraps a small typed subset of vLLM's
AsyncEngineArgs (quantization, tensor_parallel_size, dtype, etc.).
Anything outside that subset -- pipeline/data/expert parallelism,
speculative_config, kv_transfer_config, all2all_backend, prefix
caching, chunked prefill, etc. -- requires a new protobuf field, a
Go struct field, an options.go line, and a backend.py mapping per
feature. That cadence is the bottleneck on shipping vLLM's
production feature set.
Add a generic `engine_args:` map on the model YAML that is
JSON-serialised into a new ModelOptions.EngineArgs proto field and
applied verbatim to AsyncEngineArgs at LoadModel time. Validation
is done by the Python backend via dataclasses.fields(); unknown
keys fail with the closest valid name as a hint.
dataclasses.replace() is used so vLLM's __post_init__ re-runs and
auto-converts dict values into nested config dataclasses
(CompilationConfig, AttentionConfig, ...). speculative_config and
kv_transfer_config flow through as dicts; vLLM converts them at
engine init.
Operators can now write:
engine_args:
data_parallel_size: 8
enable_expert_parallel: true
all2all_backend: deepep_low_latency
speculative_config:
method: deepseek_mtp
num_speculative_tokens: 3
kv_cache_dtype: fp8
without further proto/Go/Python plumbing per field.
Production defaults seeded by hooks_vllm.go: enable_prefix_caching
and enable_chunked_prefill default to true unless explicitly set.
Existing typed YAML fields (gpu_memory_utilization,
tensor_parallel_size, etc.) remain for back-compat; engine_args
overrides them when both are set.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* chore(vllm): pin cublas13 to vLLM 0.20.0 cu130 wheel
vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't
load on a cu130 host. Switch the cublas13 build to vLLM's per-tag cu130
simple-index (https://wheels.vllm.ai/0.20.0/cu130/) and pin
vllm==0.20.0. The cu130-flavoured wheel ships libcudart.so.13 and
includes the DFlash speculative-decoding method that landed in 0.20.0.
cublas13 install gets --index-strategy=unsafe-best-match so uv consults
both the cu130 index and PyPI when resolving — PyPI also publishes
vllm==0.20.0, but with cu12 binaries that error at import time.
Verified: Qwen3.5-4B + z-lab/Qwen3.5-4B-DFlash loads and serves chat
completions on RTX 5070 Ti (sm_120, cu130).
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* ci(vllm): bot job to bump cublas13 vLLM wheel pin
vLLM's cu130 wheel index URL is itself version-locked
(wheels.vllm.ai/<TAG>/cu130/, no /latest/ alias upstream), so a vLLM
bump means rewriting two values atomically — the URL segment and the
version constraint. bump_deps.sh handles git-sha-in-Makefile only;
add a sibling bump_vllm_wheel.sh and a matching workflow job that
mirrors the existing matrix's PR-creation pattern.
The bumper queries /releases/latest (which excludes prereleases),
strips the leading 'v', and seds both lines unconditionally. When the
file is already on the latest tag the rewrite is a no-op and
peter-evans/create-pull-request opens no PR.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* docs(vllm): document engine_args and speculative decoding
The new engine_args: map plumbs arbitrary AsyncEngineArgs through to
vLLM, but the public docs only covered the basic typed fields. Add a
short subsection in the vLLM section explaining the typed/generic
split and showing a worked DFlash speculative-decoding config, with
pointers to vLLM's SpeculativeConfig reference and z-lab's drafter
collection.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
55afda22e3
commit
4916f8c880
45
.github/bump_vllm_wheel.sh
vendored
Executable file
45
.github/bump_vllm_wheel.sh
vendored
Executable file
@@ -0,0 +1,45 @@
|
||||
#!/bin/bash
|
||||
# Bump the cublas13 vLLM wheel pin in requirements-cublas13-after.txt.
|
||||
#
|
||||
# vLLM's PyPI wheel is built against CUDA 12 so the cublas13 build pulls a
|
||||
# cu130-flavoured wheel from vLLM's per-tag index at
|
||||
# https://wheels.vllm.ai/<TAG>/cu130/. That URL segment is itself version-locked
|
||||
# (no /latest/ alias upstream), so bumping vLLM means rewriting both the URL
|
||||
# segment and the version constraint atomically. bump_deps.sh handles git-sha
|
||||
# vars in Makefiles; this script handles the two-value rewrite specific to the
|
||||
# vLLM requirements file.
|
||||
set -xe
|
||||
REPO=$1 # vllm-project/vllm
|
||||
FILE=$2 # backend/python/vllm/requirements-cublas13-after.txt
|
||||
VAR=$3 # VLLM_VERSION (used for output file names so the workflow can read them)
|
||||
|
||||
if [ -z "$FILE" ] || [ -z "$REPO" ] || [ -z "$VAR" ]; then
|
||||
echo "usage: $0 <repo> <requirements-file> <var-name>" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# /releases/latest returns the most recent non-prerelease tag.
|
||||
LATEST_TAG=$(curl -sS -H "Accept: application/vnd.github+json" \
|
||||
"https://api.github.com/repos/$REPO/releases/latest" \
|
||||
| python3 -c "import json,sys; print(json.load(sys.stdin)['tag_name'])")
|
||||
|
||||
# Strip leading 'v' (vLLM tags are 'v0.20.0', the URL/version use '0.20.0').
|
||||
NEW_VERSION="${LATEST_TAG#v}"
|
||||
|
||||
set +e
|
||||
CURRENT_VERSION=$(grep -oE '^vllm==[0-9]+\.[0-9]+\.[0-9]+' "$FILE" | head -1 | cut -d= -f3)
|
||||
set -e
|
||||
|
||||
# sed both lines unconditionally — peter-evans/create-pull-request opens no PR
|
||||
# when the working tree is clean, so a no-op rewrite is safe.
|
||||
sed -i "$FILE" \
|
||||
-e "s|wheels\.vllm\.ai/[^/]*/cu130|wheels.vllm.ai/$NEW_VERSION/cu130|g" \
|
||||
-e "s|^vllm==.*|vllm==$NEW_VERSION|"
|
||||
|
||||
if [ -z "$CURRENT_VERSION" ]; then
|
||||
echo "Could not find vllm==X.Y.Z in $FILE."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
echo "Changes: https://github.com/$REPO/compare/v${CURRENT_VERSION}...${LATEST_TAG}" >> "${VAR}_message.txt"
|
||||
echo "${NEW_VERSION}" >> "${VAR}_commit.txt"
|
||||
36
.github/workflows/bump_deps.yaml
vendored
36
.github/workflows/bump_deps.yaml
vendored
@@ -80,5 +80,37 @@ jobs:
|
||||
body: ${{ steps.bump.outputs.message }}
|
||||
signoff: true
|
||||
|
||||
|
||||
|
||||
bump-vllm-wheel:
|
||||
# vLLM's cu130 wheel comes from a per-tag index URL (no /latest/ alias),
|
||||
# so the cublas13 requirements file pins both a URL segment and a version
|
||||
# constraint. bump_deps.sh handles git-sha-in-Makefile only — this job
|
||||
# rewrites both values atomically when a new vLLM stable tag ships.
|
||||
if: github.repository == 'mudler/LocalAI'
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v6
|
||||
- name: Bump vLLM cu130 wheel pin 🔧
|
||||
id: bump
|
||||
run: |
|
||||
bash .github/bump_vllm_wheel.sh vllm-project/vllm backend/python/vllm/requirements-cublas13-after.txt VLLM_VERSION
|
||||
{
|
||||
echo 'message<<EOF'
|
||||
cat "VLLM_VERSION_message.txt"
|
||||
echo EOF
|
||||
} >> "$GITHUB_OUTPUT"
|
||||
{
|
||||
echo 'commit<<EOF'
|
||||
cat "VLLM_VERSION_commit.txt"
|
||||
echo EOF
|
||||
} >> "$GITHUB_OUTPUT"
|
||||
rm -rfv VLLM_VERSION_message.txt VLLM_VERSION_commit.txt
|
||||
- name: Create Pull Request
|
||||
uses: peter-evans/create-pull-request@v8
|
||||
with:
|
||||
token: ${{ secrets.UPDATE_BOT_TOKEN }}
|
||||
push-to-fork: ci-forks/LocalAI
|
||||
commit-message: ':arrow_up: Update vllm-project/vllm cu130 wheel'
|
||||
title: 'chore: :arrow_up: Update vllm-project/vllm cu130 wheel to `${{ steps.bump.outputs.commit }}`'
|
||||
branch: "update/VLLM_VERSION"
|
||||
body: ${{ steps.bump.outputs.message }}
|
||||
signoff: true
|
||||
|
||||
@@ -310,6 +310,11 @@ message ModelOptions {
|
||||
bool Reranking = 71;
|
||||
|
||||
repeated string Overrides = 72;
|
||||
|
||||
// EngineArgs carries a JSON-encoded map of backend-native engine arguments
|
||||
// applied verbatim to the backend's engine constructor (e.g. vLLM AsyncEngineArgs).
|
||||
// Unknown keys produce an error at LoadModel time.
|
||||
string EngineArgs = 73;
|
||||
}
|
||||
|
||||
message Result {
|
||||
|
||||
@@ -1,5 +1,7 @@
|
||||
#!/usr/bin/env python3
|
||||
import asyncio
|
||||
import dataclasses
|
||||
import difflib
|
||||
from concurrent import futures
|
||||
import argparse
|
||||
import signal
|
||||
@@ -101,6 +103,36 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
|
||||
opts[key.strip()] = value.strip()
|
||||
return opts
|
||||
|
||||
def _apply_engine_args(self, engine_args, engine_args_json):
|
||||
"""Apply user-supplied engine_args (JSON object) onto an AsyncEngineArgs.
|
||||
|
||||
Returns a new AsyncEngineArgs with the typed fields preserved and the
|
||||
user's overrides layered on top. Uses ``dataclasses.replace`` so vLLM's
|
||||
``__post_init__`` re-runs and auto-converts dict-valued fields like
|
||||
``compilation_config`` / ``attention_config`` into their dataclass form.
|
||||
``speculative_config`` and ``kv_transfer_config`` are accepted as dicts
|
||||
directly (vLLM converts them at engine init).
|
||||
|
||||
Unknown keys raise ValueError with the closest valid field as a hint.
|
||||
"""
|
||||
if not engine_args_json:
|
||||
return engine_args
|
||||
try:
|
||||
extra = json.loads(engine_args_json)
|
||||
except json.JSONDecodeError as e:
|
||||
raise ValueError(f"engine_args is not valid JSON: {e}") from e
|
||||
if not isinstance(extra, dict):
|
||||
raise ValueError(
|
||||
f"engine_args must be a JSON object, got {type(extra).__name__}"
|
||||
)
|
||||
valid = {f.name for f in dataclasses.fields(type(engine_args))}
|
||||
for key in extra:
|
||||
if key not in valid:
|
||||
suggestion = difflib.get_close_matches(key, valid, n=1)
|
||||
hint = f" did you mean {suggestion[0]!r}?" if suggestion else ""
|
||||
raise ValueError(f"unknown engine_args key {key!r}.{hint}")
|
||||
return dataclasses.replace(engine_args, **extra)
|
||||
|
||||
def _messages_to_dicts(self, messages):
|
||||
"""Convert proto Messages to list of dicts suitable for apply_chat_template()."""
|
||||
result = []
|
||||
@@ -176,6 +208,15 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
|
||||
"audio": max(request.LimitAudioPerPrompt, 1)
|
||||
}
|
||||
|
||||
# engine_args from YAML overrides typed fields above so operators can
|
||||
# tune anything the AsyncEngineArgs dataclass exposes without waiting
|
||||
# on protobuf changes.
|
||||
try:
|
||||
engine_args = self._apply_engine_args(engine_args, request.EngineArgs)
|
||||
except ValueError as err:
|
||||
print(f"engine_args error: {err}", file=sys.stderr)
|
||||
return backend_pb2.Result(success=False, message=str(err))
|
||||
|
||||
try:
|
||||
self.llm = AsyncLLMEngine.from_engine_args(engine_args)
|
||||
except Exception as err:
|
||||
|
||||
@@ -32,6 +32,14 @@ if [ "x${BUILD_PROFILE}" == "xcpu" ]; then
|
||||
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
|
||||
fi
|
||||
|
||||
# cublas13 pulls the vLLM wheel from a per-tag cu130 index (PyPI's vllm wheel
|
||||
# is built against CUDA 12 and won't load on cu130). uv's default per-package
|
||||
# first-match strategy would still pick the PyPI wheel, so allow it to consult
|
||||
# every configured index when resolving.
|
||||
if [ "x${BUILD_PROFILE}" == "xcublas13" ]; then
|
||||
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
|
||||
fi
|
||||
|
||||
# JetPack 7 / L4T arm64 wheels (torch, vllm, flash-attn) live on
|
||||
# pypi.jetson-ai-lab.io and are built for cp312, so bump the venv Python
|
||||
# accordingly. JetPack 6 keeps cp310 + USE_PIP=true. unsafe-best-match
|
||||
|
||||
@@ -1,2 +1,7 @@
|
||||
--extra-index-url https://download.pytorch.org/whl/cu130
|
||||
vllm
|
||||
# vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't load
|
||||
# on a cu130 host. Pull the cu130-flavoured wheel from vLLM's per-tag index
|
||||
# instead — the cublas13 case in install.sh adds --index-strategy=unsafe-best-match
|
||||
# so uv consults this index alongside PyPI.
|
||||
--extra-index-url https://wheels.vllm.ai/0.20.0/cu130
|
||||
vllm==0.20.0
|
||||
|
||||
@@ -168,6 +168,58 @@ class TestBackendServicer(unittest.TestCase):
|
||||
self.assertEqual(opts["key_with_colons"], "a:b:c")
|
||||
self.assertNotIn("invalid_no_colon", opts)
|
||||
|
||||
def test_apply_engine_args_known_keys(self):
|
||||
"""
|
||||
Tests _apply_engine_args overlays user-supplied JSON onto AsyncEngineArgs.
|
||||
"""
|
||||
import sys, os, json as _json
|
||||
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
||||
from backend import BackendServicer
|
||||
from vllm.engine.arg_utils import AsyncEngineArgs
|
||||
|
||||
servicer = BackendServicer()
|
||||
base = AsyncEngineArgs(model="facebook/opt-125m")
|
||||
extras = _json.dumps({
|
||||
"trust_remote_code": True,
|
||||
"max_num_seqs": 32,
|
||||
})
|
||||
out = servicer._apply_engine_args(base, extras)
|
||||
self.assertTrue(out.trust_remote_code)
|
||||
self.assertEqual(out.max_num_seqs, 32)
|
||||
# untouched fields preserved
|
||||
self.assertEqual(out.model, "facebook/opt-125m")
|
||||
|
||||
def test_apply_engine_args_unknown_key_raises(self):
|
||||
"""
|
||||
Tests _apply_engine_args rejects unknown keys with a helpful suggestion.
|
||||
"""
|
||||
import sys, os, json as _json
|
||||
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
||||
from backend import BackendServicer
|
||||
from vllm.engine.arg_utils import AsyncEngineArgs
|
||||
|
||||
servicer = BackendServicer()
|
||||
base = AsyncEngineArgs(model="facebook/opt-125m")
|
||||
with self.assertRaises(ValueError) as ctx:
|
||||
servicer._apply_engine_args(base, _json.dumps({"trustremotecode": True}))
|
||||
self.assertIn("trustremotecode", str(ctx.exception))
|
||||
# close-match hint for the typo
|
||||
self.assertIn("trust_remote_code", str(ctx.exception))
|
||||
|
||||
def test_apply_engine_args_empty_passthrough(self):
|
||||
"""
|
||||
Tests that empty engine_args returns the base unchanged.
|
||||
"""
|
||||
import sys, os
|
||||
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
||||
from backend import BackendServicer
|
||||
from vllm.engine.arg_utils import AsyncEngineArgs
|
||||
|
||||
servicer = BackendServicer()
|
||||
base = AsyncEngineArgs(model="facebook/opt-125m")
|
||||
self.assertIs(servicer._apply_engine_args(base, ""), base)
|
||||
self.assertIs(servicer._apply_engine_args(base, None), base)
|
||||
|
||||
def test_tokenize_string(self):
|
||||
"""
|
||||
Tests the TokenizeString RPC returns valid tokens.
|
||||
|
||||
@@ -1,6 +1,8 @@
|
||||
package backend
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"math/rand/v2"
|
||||
"os"
|
||||
"path/filepath"
|
||||
@@ -159,6 +161,19 @@ func grpcModelOpts(c config.ModelConfig, modelPath string) *pb.ModelOptions {
|
||||
})
|
||||
}
|
||||
|
||||
engineArgsJSON := ""
|
||||
if len(c.EngineArgs) > 0 {
|
||||
buf, err := json.Marshal(c.EngineArgs)
|
||||
if err != nil {
|
||||
// ModelConfig.Validate() rejects unmarshalable engine_args at
|
||||
// config load, so reaching here means the validator was bypassed.
|
||||
// Silently dropping user-set options would change runtime behaviour
|
||||
// without warning — fail loud instead.
|
||||
panic(fmt.Sprintf("engine_args marshal failed for model %q: %v (Validate() should have caught this)", c.Model, err))
|
||||
}
|
||||
engineArgsJSON = string(buf)
|
||||
}
|
||||
|
||||
opts := &pb.ModelOptions{
|
||||
CUDA: c.CUDA || c.Diffusers.CUDA,
|
||||
SchedulerType: c.Diffusers.SchedulerType,
|
||||
@@ -176,6 +191,7 @@ func grpcModelOpts(c config.ModelConfig, modelPath string) *pb.ModelOptions {
|
||||
CLIPSubfolder: c.Diffusers.ClipSubFolder,
|
||||
Options: c.Options,
|
||||
Overrides: c.Overrides,
|
||||
EngineArgs: engineArgsJSON,
|
||||
CLIPSkip: int32(c.Diffusers.ClipSkip),
|
||||
ControlNet: c.Diffusers.ControlNet,
|
||||
ContextSize: int32(ctxSize),
|
||||
|
||||
44
core/backend/options_internal_test.go
Normal file
44
core/backend/options_internal_test.go
Normal file
@@ -0,0 +1,44 @@
|
||||
package backend
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
|
||||
"github.com/mudler/LocalAI/core/config"
|
||||
|
||||
. "github.com/onsi/ginkgo/v2"
|
||||
. "github.com/onsi/gomega"
|
||||
)
|
||||
|
||||
var _ = Describe("grpcModelOpts EngineArgs", func() {
|
||||
It("serialises engine_args as JSON preserving nested values", func() {
|
||||
threads := 1
|
||||
cfg := config.ModelConfig{
|
||||
Threads: &threads,
|
||||
LLMConfig: config.LLMConfig{
|
||||
EngineArgs: map[string]any{
|
||||
"data_parallel_size": 8,
|
||||
"enable_expert_parallel": true,
|
||||
"speculative_config": map[string]any{
|
||||
"method": "ngram",
|
||||
"num_speculative_tokens": 4,
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
opts := grpcModelOpts(cfg, "/tmp/models")
|
||||
Expect(opts.EngineArgs).NotTo(BeEmpty())
|
||||
|
||||
var round map[string]any
|
||||
Expect(json.Unmarshal([]byte(opts.EngineArgs), &round)).To(Succeed())
|
||||
Expect(round["data_parallel_size"]).To(BeEquivalentTo(8))
|
||||
Expect(round["enable_expert_parallel"]).To(BeTrue())
|
||||
Expect(round["speculative_config"]).To(HaveKeyWithValue("method", "ngram"))
|
||||
})
|
||||
|
||||
It("leaves EngineArgs empty when unset", func() {
|
||||
threads := 1
|
||||
opts := grpcModelOpts(config.ModelConfig{Threads: &threads}, "/tmp/models")
|
||||
Expect(opts.EngineArgs).To(BeEmpty())
|
||||
})
|
||||
})
|
||||
@@ -110,5 +110,30 @@ var _ = Describe("Backend hooks and parser defaults", func() {
|
||||
}
|
||||
Expect(count).To(Equal(1))
|
||||
})
|
||||
|
||||
It("seeds production engine_args defaults", func() {
|
||||
cfg := &ModelConfig{Backend: "vllm"}
|
||||
cfg.SetDefaults()
|
||||
|
||||
Expect(cfg.EngineArgs).NotTo(BeNil())
|
||||
Expect(cfg.EngineArgs["enable_prefix_caching"]).To(Equal(true))
|
||||
Expect(cfg.EngineArgs["enable_chunked_prefill"]).To(Equal(true))
|
||||
})
|
||||
|
||||
It("does not override user-set engine_args", func() {
|
||||
cfg := &ModelConfig{
|
||||
Backend: "vllm",
|
||||
LLMConfig: LLMConfig{
|
||||
EngineArgs: map[string]any{
|
||||
"enable_prefix_caching": false,
|
||||
},
|
||||
},
|
||||
}
|
||||
cfg.SetDefaults()
|
||||
|
||||
Expect(cfg.EngineArgs["enable_prefix_caching"]).To(Equal(false))
|
||||
// chunked_prefill is still seeded since user didn't set it
|
||||
Expect(cfg.EngineArgs["enable_chunked_prefill"]).To(Equal(true))
|
||||
})
|
||||
})
|
||||
})
|
||||
|
||||
@@ -45,8 +45,34 @@ func MatchParserDefaults(modelID string) map[string]string {
|
||||
return nil
|
||||
}
|
||||
|
||||
// productionEngineArgsDefaults are vLLM ≥ 0.6 features that production deployments
|
||||
// almost always want. Applied at load time when the user hasn't set the key in
|
||||
// engine_args. Anything user-supplied wins; we never silently override.
|
||||
var productionEngineArgsDefaults = map[string]any{
|
||||
"enable_prefix_caching": true,
|
||||
"enable_chunked_prefill": true,
|
||||
}
|
||||
|
||||
func vllmDefaults(cfg *ModelConfig, modelPath string) {
|
||||
// Check if user already set tool_parser or reasoning_parser in Options
|
||||
applyEngineArgDefaults(cfg)
|
||||
applyParserDefaults(cfg)
|
||||
}
|
||||
|
||||
// applyEngineArgDefaults seeds production-friendly engine_args without overwriting
|
||||
// anything the user already set.
|
||||
func applyEngineArgDefaults(cfg *ModelConfig) {
|
||||
if cfg.EngineArgs == nil {
|
||||
cfg.EngineArgs = map[string]any{}
|
||||
}
|
||||
for k, v := range productionEngineArgsDefaults {
|
||||
if _, set := cfg.EngineArgs[k]; set {
|
||||
continue
|
||||
}
|
||||
cfg.EngineArgs[k] = v
|
||||
}
|
||||
}
|
||||
|
||||
func applyParserDefaults(cfg *ModelConfig) {
|
||||
hasToolParser := false
|
||||
hasReasoningParser := false
|
||||
for _, opt := range cfg.Options {
|
||||
@@ -61,7 +87,6 @@ func vllmDefaults(cfg *ModelConfig, modelPath string) {
|
||||
return
|
||||
}
|
||||
|
||||
// Try matching against Model field, then Name
|
||||
parsers := MatchParserDefaults(cfg.Model)
|
||||
if parsers == nil {
|
||||
parsers = MatchParserDefaults(cfg.Name)
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
package config
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"os"
|
||||
"regexp"
|
||||
@@ -241,7 +242,13 @@ type LLMConfig struct {
|
||||
DisableLogStatus bool `yaml:"disable_log_stats,omitempty" json:"disable_log_stats,omitempty"` // vLLM
|
||||
DType string `yaml:"dtype,omitempty" json:"dtype,omitempty"` // vLLM
|
||||
LimitMMPerPrompt LimitMMPerPrompt `yaml:"limit_mm_per_prompt,omitempty" json:"limit_mm_per_prompt,omitempty"` // vLLM
|
||||
MMProj string `yaml:"mmproj,omitempty" json:"mmproj,omitempty"`
|
||||
// EngineArgs is a backend-native passthrough applied to the engine constructor
|
||||
// (e.g. vLLM AsyncEngineArgs). Values may be primitives or nested maps; nested
|
||||
// maps materialise into the backend's nested config dataclasses (e.g.
|
||||
// SpeculativeConfig, KVTransferConfig, CompilationConfig). Unknown keys cause
|
||||
// the backend to fail LoadModel with a list of valid names.
|
||||
EngineArgs map[string]any `yaml:"engine_args,omitempty" json:"engine_args,omitempty"`
|
||||
MMProj string `yaml:"mmproj,omitempty" json:"mmproj,omitempty"`
|
||||
|
||||
FlashAttention *string `yaml:"flash_attention,omitempty" json:"flash_attention,omitempty"`
|
||||
NoKVOffloading bool `yaml:"no_kv_offloading,omitempty" json:"no_kv_offloading,omitempty"`
|
||||
@@ -545,6 +552,15 @@ func (c *ModelConfig) Validate() (bool, error) {
|
||||
}
|
||||
}
|
||||
|
||||
// engine_args crosses the gRPC boundary as a JSON-encoded string. Reject
|
||||
// unmarshalable values here so a config that would silently lose user-set
|
||||
// options at load time is rejected at parse time instead.
|
||||
if len(c.EngineArgs) > 0 {
|
||||
if _, err := json.Marshal(c.EngineArgs); err != nil {
|
||||
return false, fmt.Errorf("engine_args is not JSON-serialisable: %w", err)
|
||||
}
|
||||
}
|
||||
|
||||
return true, nil
|
||||
}
|
||||
|
||||
|
||||
@@ -230,4 +230,38 @@ mcp:
|
||||
Expect(err).To(BeNil())
|
||||
Expect(valid).To(BeTrue())
|
||||
})
|
||||
It("Test Validate rejects unmarshalable engine_args", func() {
|
||||
// chan values cannot be JSON-marshalled. A valid YAML config could
|
||||
// not produce one, but a Go caller stuffing a bad value would, and
|
||||
// silently dropping it would change runtime behaviour.
|
||||
cfg := &ModelConfig{
|
||||
Backend: "vllm",
|
||||
LLMConfig: LLMConfig{
|
||||
EngineArgs: map[string]any{
|
||||
"speculative_config": make(chan int),
|
||||
},
|
||||
},
|
||||
}
|
||||
valid, err := cfg.Validate()
|
||||
Expect(valid).To(BeFalse())
|
||||
Expect(err).ToNot(BeNil())
|
||||
Expect(err.Error()).To(ContainSubstring("engine_args is not JSON-serialisable"))
|
||||
})
|
||||
It("Test Validate accepts well-formed engine_args", func() {
|
||||
cfg := &ModelConfig{
|
||||
Backend: "vllm",
|
||||
LLMConfig: LLMConfig{
|
||||
EngineArgs: map[string]any{
|
||||
"data_parallel_size": 8,
|
||||
"speculative_config": map[string]any{
|
||||
"method": "ngram",
|
||||
"num_speculative_tokens": 4,
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
valid, err := cfg.Validate()
|
||||
Expect(err).To(BeNil())
|
||||
Expect(valid).To(BeTrue())
|
||||
})
|
||||
})
|
||||
|
||||
@@ -665,6 +665,46 @@ curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d
|
||||
}'
|
||||
```
|
||||
|
||||
#### Passing arbitrary vLLM options with `engine_args`
|
||||
|
||||
A subset of `AsyncEngineArgs` is exposed as typed YAML fields
|
||||
(`tensor_parallel_size`, `gpu_memory_utilization`, `quantization`,
|
||||
`max_model_len`, `dtype`, `trust_remote_code`, `enforce_eager`, …).
|
||||
Anything else can be passed through the generic `engine_args:` map.
|
||||
Keys are forwarded verbatim to vLLM's engine; unknown keys fail at load
|
||||
time with the closest valid name as a hint. Nested maps materialise
|
||||
into vLLM's nested config dataclasses (`SpeculativeConfig`,
|
||||
`KVTransferConfig`, `CompilationConfig`, …).
|
||||
|
||||
Speculative decoding (DFlash, ngram, eagle, deepseek_mtp, …) is
|
||||
configured this way:
|
||||
|
||||
```yaml
|
||||
name: qwen3.5-4b-dflash
|
||||
backend: vllm
|
||||
parameters:
|
||||
model: Qwen/Qwen3.5-4B
|
||||
context_size: 8192
|
||||
max_model_len: 8192
|
||||
trust_remote_code: true
|
||||
quantization: fp8
|
||||
template:
|
||||
use_tokenizer_template: true
|
||||
engine_args:
|
||||
speculative_config:
|
||||
method: dflash
|
||||
model: z-lab/Qwen3.5-4B-DFlash
|
||||
num_speculative_tokens: 15
|
||||
```
|
||||
|
||||
The shape of `speculative_config` follows vLLM's
|
||||
[`SpeculativeConfig`](https://docs.vllm.ai/en/latest/api/vllm/config/speculative.html)
|
||||
— `method` picks the algorithm, the remaining keys are method-specific.
|
||||
Drafters from [z-lab](https://huggingface.co/z-lab) are paired with
|
||||
specific target models; pick the one that matches your target. The
|
||||
drafter loads in its native precision regardless of the target's
|
||||
`quantization:` setting.
|
||||
|
||||
### Transformers
|
||||
|
||||
[Transformers](https://huggingface.co/docs/transformers/index) is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX.
|
||||
|
||||
Reference in New Issue
Block a user