Files
LocalAI/core/config/hooks_vllm.go
Richard Palethorpe 4916f8c880 feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map (#9563)
* feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map

LocalAI's vLLM backend wraps a small typed subset of vLLM's
AsyncEngineArgs (quantization, tensor_parallel_size, dtype, etc.).
Anything outside that subset -- pipeline/data/expert parallelism,
speculative_config, kv_transfer_config, all2all_backend, prefix
caching, chunked prefill, etc. -- requires a new protobuf field, a
Go struct field, an options.go line, and a backend.py mapping per
feature. That cadence is the bottleneck on shipping vLLM's
production feature set.

Add a generic `engine_args:` map on the model YAML that is
JSON-serialised into a new ModelOptions.EngineArgs proto field and
applied verbatim to AsyncEngineArgs at LoadModel time. Validation
is done by the Python backend via dataclasses.fields(); unknown
keys fail with the closest valid name as a hint.
dataclasses.replace() is used so vLLM's __post_init__ re-runs and
auto-converts dict values into nested config dataclasses
(CompilationConfig, AttentionConfig, ...). speculative_config and
kv_transfer_config flow through as dicts; vLLM converts them at
engine init.

Operators can now write:

  engine_args:
    data_parallel_size: 8
    enable_expert_parallel: true
    all2all_backend: deepep_low_latency
    speculative_config:
      method: deepseek_mtp
      num_speculative_tokens: 3
    kv_cache_dtype: fp8

without further proto/Go/Python plumbing per field.

Production defaults seeded by hooks_vllm.go: enable_prefix_caching
and enable_chunked_prefill default to true unless explicitly set.

Existing typed YAML fields (gpu_memory_utilization,
tensor_parallel_size, etc.) remain for back-compat; engine_args
overrides them when both are set.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(vllm): pin cublas13 to vLLM 0.20.0 cu130 wheel

vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't
load on a cu130 host. Switch the cublas13 build to vLLM's per-tag cu130
simple-index (https://wheels.vllm.ai/0.20.0/cu130/) and pin
vllm==0.20.0. The cu130-flavoured wheel ships libcudart.so.13 and
includes the DFlash speculative-decoding method that landed in 0.20.0.

cublas13 install gets --index-strategy=unsafe-best-match so uv consults
both the cu130 index and PyPI when resolving — PyPI also publishes
vllm==0.20.0, but with cu12 binaries that error at import time.

Verified: Qwen3.5-4B + z-lab/Qwen3.5-4B-DFlash loads and serves chat
completions on RTX 5070 Ti (sm_120, cu130).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* ci(vllm): bot job to bump cublas13 vLLM wheel pin

vLLM's cu130 wheel index URL is itself version-locked
(wheels.vllm.ai/<TAG>/cu130/, no /latest/ alias upstream), so a vLLM
bump means rewriting two values atomically — the URL segment and the
version constraint. bump_deps.sh handles git-sha-in-Makefile only;
add a sibling bump_vllm_wheel.sh and a matching workflow job that
mirrors the existing matrix's PR-creation pattern.

The bumper queries /releases/latest (which excludes prereleases),
strips the leading 'v', and seds both lines unconditionally. When the
file is already on the latest tag the rewrite is a no-op and
peter-evans/create-pull-request opens no PR.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* docs(vllm): document engine_args and speculative decoding

The new engine_args: map plumbs arbitrary AsyncEngineArgs through to
vLLM, but the public docs only covered the basic typed fields. Add a
short subsection in the vLLM section explaining the typed/generic
split and showing a worked DFlash speculative-decoding config, with
pointers to vLLM's SpeculativeConfig reference and z-lab's drafter
collection.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-29 00:49:28 +02:00

111 lines
2.9 KiB
Go

package config
import (
_ "embed"
"encoding/json"
"strings"
"github.com/mudler/xlog"
)
//go:embed parser_defaults.json
var parserDefaultsJSON []byte
type parserDefaultsData struct {
Families map[string]map[string]string `json:"families"`
Patterns []string `json:"patterns"`
}
var parsersData *parserDefaultsData
func init() {
parsersData = &parserDefaultsData{}
if err := json.Unmarshal(parserDefaultsJSON, parsersData); err != nil {
xlog.Warn("failed to parse parser_defaults.json", "error", err)
}
RegisterBackendHook("vllm", vllmDefaults)
RegisterBackendHook("vllm-omni", vllmDefaults)
}
// MatchParserDefaults returns parser defaults for the best-matching model family.
// Returns nil if no family matches. Used both at load time (via hook) and at import time.
func MatchParserDefaults(modelID string) map[string]string {
if parsersData == nil || len(parsersData.Patterns) == 0 {
return nil
}
normalized := normalizeModelID(modelID)
for _, pattern := range parsersData.Patterns {
if strings.Contains(normalized, pattern) {
if family, ok := parsersData.Families[pattern]; ok {
return family
}
}
}
return nil
}
// productionEngineArgsDefaults are vLLM ≥ 0.6 features that production deployments
// almost always want. Applied at load time when the user hasn't set the key in
// engine_args. Anything user-supplied wins; we never silently override.
var productionEngineArgsDefaults = map[string]any{
"enable_prefix_caching": true,
"enable_chunked_prefill": true,
}
func vllmDefaults(cfg *ModelConfig, modelPath string) {
applyEngineArgDefaults(cfg)
applyParserDefaults(cfg)
}
// applyEngineArgDefaults seeds production-friendly engine_args without overwriting
// anything the user already set.
func applyEngineArgDefaults(cfg *ModelConfig) {
if cfg.EngineArgs == nil {
cfg.EngineArgs = map[string]any{}
}
for k, v := range productionEngineArgsDefaults {
if _, set := cfg.EngineArgs[k]; set {
continue
}
cfg.EngineArgs[k] = v
}
}
func applyParserDefaults(cfg *ModelConfig) {
hasToolParser := false
hasReasoningParser := false
for _, opt := range cfg.Options {
if strings.HasPrefix(opt, "tool_parser:") {
hasToolParser = true
}
if strings.HasPrefix(opt, "reasoning_parser:") {
hasReasoningParser = true
}
}
if hasToolParser && hasReasoningParser {
return
}
parsers := MatchParserDefaults(cfg.Model)
if parsers == nil {
parsers = MatchParserDefaults(cfg.Name)
}
if parsers == nil {
return
}
if !hasToolParser {
if tp, ok := parsers["tool_parser"]; ok {
cfg.Options = append(cfg.Options, "tool_parser:"+tp)
xlog.Debug("[parser_defaults] auto-set tool_parser", "parser", tp, "model", cfg.Model)
}
}
if !hasReasoningParser {
if rp, ok := parsers["reasoning_parser"]; ok {
cfg.Options = append(cfg.Options, "reasoning_parser:"+rp)
xlog.Debug("[parser_defaults] auto-set reasoning_parser", "parser", rp, "model", cfg.Model)
}
}
}