feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686)

Bring the sglang Python backend up to feature parity with vllm by adding
the same engine_args:-map plumbing the vLLM backend already has. Any
ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model
YAML, including the speculative-decoding flags needed for Multi-Token
Prediction. Validation matches the vllm backend's: keys are checked
against dataclasses.fields(ServerArgs), unknown keys raise ValueError
with a difflib close-match suggestion at LoadModel time, and the typed
ModelOptions fields keep their existing meaning with engine_args
overriding them.

Backend code:
* backend/python/sglang/backend.py: add _apply_engine_args, import
  dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed ->
  sampling_seed (sglang 0.5.11 renamed the SamplingParams field).
* backend/python/sglang/test.py + test.sh + Makefile: six unit tests
  exercising the helper directly (no engine load required).

Build / CI / backend gallery (cuda13 + l4t13 paths are now first-class):
* backend/python/sglang/install.sh: add --prerelease=allow because
  sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels;
  add --index-strategy=unsafe-best-match for cublas12 so the cu128
  torch index wins over default-PyPI's cu130; new pyproject.toml-driven
  l4t13 install path so [tool.uv.sources] can pin torch/torchvision/
  torchaudio/sglang to the jetson-ai-lab index without forcing every
  transitive PyPI dep through the L4T mirror's flaky proxy (mirrors the
  equivalent fix in backend/python/vllm/install.sh).
* backend/python/sglang/pyproject.toml (new): L4T project spec with
  explicit-source jetson-ai-lab index. Replaces requirements-l4t13.txt
  for the l4t13 BUILD_PROFILE; other profiles still go through the
  requirements-*.txt pipeline via libbackend.sh's installRequirements.
* backend/python/sglang/requirements-l4t13.txt: removed; superseded
  by pyproject.toml.
* backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin
  sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13
  (new files) and cu128 torch index for cublas12 (default PyPI now
  ships cu130 torch wheels by default and breaks cu12 hosts).
* backend/index.yaml: add cuda13-sglang and cuda13-sglang-development
  capability mappings + image entries pointing at
  quay.io/.../-gpu-nvidia-cuda-13-sglang.
* .github/workflows/backend.yml: new cublas13 sglang matrix entry,
  mirroring vllm's cuda13 build.

Model gallery + docs:
* gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml.
* gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos
  transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands.
* gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads
  + online fp8 weight quantization, verified end-to-end on a 16 GB
  RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the
  MTP draft worker's vocab embedding is loaded unquantised and OOMs
  the static reservation at sglang's 0.85 default.
* gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp,
  gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang).
* docs/content/features/text-generation.md: new SGLang section with
  setup, engine_args reference, MTP demos, version requirements.
* .agents/sglang-backend.md (new): agent one-pager covering the flat
  ServerArgs structure, the typed-vs-engine_args precedence, the
  speculative-decoding cheatsheet, and the mem_fraction_static gotcha
  documented above.
* AGENTS.md: index entry for the new agent doc.

Known limitation: the two Gemma 4 MTP gallery entries ship a recipe
that doesn't yet run on stock libraries. The drafter checkpoints
(google/gemma-4-{E2B,E4B}-it-assistant) declare
model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which
neither transformers (<=5.6.0, including the SGLang cookbook's pinned
commit 91b1ab1f... and main HEAD) nor sglang's own model registry
(<=0.5.11) registers as of 2026-05-06. They will start working when
HF or sglang upstream registers the architecture -- no LocalAI
changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work
today on this build (verified on RTX 5070 Ti, 16 GB).

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
This commit is contained in:
Richard Palethorpe
2026-05-07 16:27:29 +01:00
committed by GitHub
parent 048daa0cdc
commit c894d9c826
21 changed files with 732 additions and 21 deletions

View File

@@ -8,6 +8,12 @@ run: sglang
bash run.sh
@echo "sglang run."
.PHONY: test
test: sglang
@echo "Testing sglang..."
bash test.sh
@echo "sglang tested."
.PHONY: protogen-clean
protogen-clean:
$(RM) backend_pb2_grpc.py backend_pb2.py

View File

@@ -9,10 +9,18 @@ The streaming path applies sglang's per-request FunctionCallParser and
ReasoningParser so tool_calls and reasoning_content are emitted
incrementally inside ChatDelta, which is a capability sglang exposes
natively and vLLM does not.
Like the vLLM backend, this one accepts an arbitrary ``engine_args:``
map in the model YAML; keys are validated against ``ServerArgs`` fields
and forwarded to ``Engine(**kwargs)``. That covers speculative decoding
(EAGLE/EAGLE3/DFLASH/NGRAM/STANDALONE plus MTP via NEXTN), attention
backend selection, MoE knobs, hierarchical cache, and so on.
"""
import asyncio
from concurrent import futures
import argparse
import dataclasses
import difflib
import signal
import sys
import os
@@ -38,6 +46,7 @@ from grpc_auth import get_auth_interceptors
# are wrapped in try/except so older / leaner installs that omit them
# still load the backend for plain text generation.
from sglang.srt.entrypoints.engine import Engine
from sglang.srt.server_args import ServerArgs
try:
from sglang.srt.function_call.function_call_parser import FunctionCallParser
@@ -66,6 +75,19 @@ except Exception:
HAS_TRANSFORMERS = False
# sglang 0.5.11 renamed SamplingParams.seed -> sampling_seed (PR #21952).
# Earlier 0.5.x releases (e.g. 0.5.1.post2 — the wheel still pinned by the
# pypi.jetson-ai-lab.io sbsa/cu130 mirror used by the l4t13 build profile)
# accept only `seed`. Detect the supported keyword once at import time so
# both versions work without a hard pin floor.
try:
import inspect as _inspect
from sglang.srt.sampling.sampling_params import SamplingParams as _SamplingParams
_SEED_KEY = "sampling_seed" if "sampling_seed" in _inspect.signature(_SamplingParams).parameters else "seed"
except Exception:
_SEED_KEY = "sampling_seed"
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1'))
@@ -82,6 +104,37 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
opts[key.strip()] = value.strip()
return opts
def _apply_engine_args(self, engine_kwargs: dict, engine_args_json: str) -> dict:
"""Merge user-supplied engine_args (JSON object) into the kwargs dict
that will be forwarded to ``sglang.Engine`` (which constructs a
``ServerArgs`` from them).
Mirrors ``backend/python/vllm/backend.py::_apply_engine_args`` but
operates on the kwargs dict because sglang's ``Engine.__init__``
accepts ``**kwargs`` directly rather than a pre-built dataclass.
Validation happens against ``ServerArgs`` fields so a typo fails
early with a close-match suggestion instead of producing a confusing
``TypeError`` deep inside engine startup.
"""
if not engine_args_json:
return engine_kwargs
try:
extra = json.loads(engine_args_json)
except json.JSONDecodeError as e:
raise ValueError(f"engine_args is not valid JSON: {e}") from e
if not isinstance(extra, dict):
raise ValueError(
f"engine_args must be a JSON object, got {type(extra).__name__}"
)
valid = {f.name for f in dataclasses.fields(ServerArgs)}
for key in extra:
if key not in valid:
suggestion = difflib.get_close_matches(key, valid, n=1)
hint = f" did you mean {suggestion[0]!r}?" if suggestion else ""
raise ValueError(f"unknown engine_args key {key!r}.{hint}")
engine_kwargs.update(extra)
return engine_kwargs
def _messages_to_dicts(self, messages) -> List[dict]:
result: List[dict] = []
for msg in messages:
@@ -137,6 +190,16 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
if self.reasoning_parser_name:
engine_kwargs["reasoning_parser"] = self.reasoning_parser_name
# engine_args from YAML overrides typed fields above so operators can
# tune anything ServerArgs exposes (speculative decoding, attention
# backend, MoE, hierarchical cache, …) without waiting on protobuf
# changes.
try:
engine_kwargs = self._apply_engine_args(engine_kwargs, request.EngineArgs)
except ValueError as err:
print(f"engine_args error: {err}", file=sys.stderr)
return backend_pb2.Result(success=False, message=str(err))
try:
self.llm = Engine(**engine_kwargs)
except Exception as err:
@@ -221,7 +284,7 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
"TopP": "top_p",
"TopK": "top_k",
"MinP": "min_p",
"Seed": "seed",
"Seed": _SEED_KEY,
"StopPrompts": "stop",
"StopTokenIds": "stop_token_ids",
"IgnoreEOS": "ignore_eos",

View File

@@ -23,17 +23,32 @@ if [ "x${BUILD_PROFILE}" == "xcpu" ]; then
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
fi
# cublas12 needs a cu128 torch index (see requirements-cublas12.txt) — without
# unsafe-best-match uv falls through to default PyPI's cu130 torch wheel and
# the resulting sgl-kernel can't load on our cu12 host libs.
if [ "x${BUILD_PROFILE}" == "xcublas12" ]; then
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
fi
# sglang 0.5.11 (Gemma 4 support) declares flash-attn-4 as a hard dep, but
# upstream only publishes pre-release wheels (4.0.0b*). uv rejects
# pre-releases by default — opt in for sglang specifically. Drop this once
# flash-attn-4 4.0 stable lands.
EXTRA_PIP_INSTALL_FLAGS+=" --prerelease=allow"
# JetPack 7 / L4T arm64 wheels are built for cp312 and shipped via
# pypi.jetson-ai-lab.io. Bump the venv Python so the prebuilt sglang
# wheel resolves cleanly. unsafe-best-match is required because the
# jetson-ai-lab index lists transitive deps (e.g. decord) at older
# versions only — without it uv refuses to fall through to PyPI for a
# compatible wheel and resolution fails.
# wheel resolves cleanly. The actual install on l4t13 goes through
# pyproject.toml (see the elif branch below) so [tool.uv.sources] can
# pin only torch/torchvision/torchaudio/sglang to the jetson-ai-lab
# index — leaving PyPI as the path for transitive deps like
# markdown-it-py / anthropic / propcache that the L4T mirror's proxy
# 503s on. No --index-strategy flag here: the explicit index keeps the
# scoping clean.
if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
PYTHON_VERSION="3.12"
PYTHON_PATCH="12"
PY_STANDALONE_TAG="20251120"
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
fi
# sglang's CPU path has no prebuilt wheel on PyPI — upstream publishes
@@ -95,6 +110,27 @@ if [ "x${BUILD_TYPE}" == "x" ] || [ "x${FROM_SOURCE:-}" == "xtrue" ]; then
fi
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} .
popd
# L4T arm64 (JetPack 7): drive the install through pyproject.toml so that
# [tool.uv.sources] can pin torch/torchvision/torchaudio/sglang to the
# jetson-ai-lab index, while everything else (transitive deps and
# PyPI-resolvable packages like transformers / accelerate) comes from
# PyPI. Bypasses installRequirements because uv pip install -r
# requirements.txt does not honor sources — see
# backend/python/sglang/pyproject.toml for the rationale. Mirrors the
# equivalent path in backend/python/vllm/install.sh.
elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
ensureVenv
if [ "x${PORTABLE_PYTHON}" == "xtrue" ]; then
export C_INCLUDE_PATH="${C_INCLUDE_PATH:-}:$(_portable_dir)/include/python${PYTHON_VERSION}"
fi
pushd "${backend_dir}"
# Build deps first (matches installRequirements' requirements-install.txt
# pass — sglang/sgl-kernel sdists need packaging/setuptools-scm in the
# venv before they can build under --no-build-isolation).
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} -r requirements-install.txt
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --requirement pyproject.toml
popd
runProtogen
else
installRequirements
fi

View File

@@ -0,0 +1,68 @@
# L4T arm64 (JetPack 7 / sbsa cu130) install spec for the sglang backend.
#
# Why this file exists, and why only the l4t13 BUILD_PROFILE consumes it:
#
# pypi.jetson-ai-lab.io hosts the L4T-specific torch / sglang / sgl-kernel
# wheels we need on aarch64 + cuda13, but it ALSO transparently proxies the
# rest of PyPI through `/+f/<sha>/<filename>` URLs that 503 frequently.
# With `--extra-index-url` + `--index-strategy=unsafe-best-match` (the
# historical fix in install.sh) uv would pick those proxy URLs for ordinary
# PyPI packages — markdown-it-py, anthropic, propcache, etc. — and trip on
# the 503s. See e.g. CI run 25439791228 (markdown-it-py-4.0.0).
#
# `explicit = true` on the index makes uv consult the L4T mirror ONLY for
# packages mapped under [tool.uv.sources]. Everything else goes to PyPI.
# This breaks the historical 503 path without losing access to the L4T
# wheels we actually need from there. Mirrors the equivalent fix already
# in backend/python/vllm/pyproject.toml.
#
# `uv pip install -r requirements.txt` does NOT honor [tool.uv.sources]
# (sources are project-mode only, not pip-compat mode), so install.sh's
# l4t13 branch invokes `uv pip install --requirement pyproject.toml`
# directly. Other BUILD_PROFILEs continue to use the requirements-*.txt
# pipeline through libbackend.sh's installRequirements and never read
# this file.
[project]
name = "localai-sglang-l4t13"
version = "0.0.0"
requires-python = ">=3.12,<3.13"
dependencies = [
# Mirror of requirements.txt — kept in sync manually for now since the
# l4t13 path bypasses installRequirements (see install.sh).
"grpcio==1.80.0",
"protobuf",
"certifi",
"setuptools",
"pillow",
# L4T-specific accelerator stack (sourced from jetson-ai-lab below).
"torch",
"torchvision",
"torchaudio",
# sglang on jetson — the [all] extra is deliberately omitted because it
# pulls outlines/decord, and decord has no aarch64 cp312 wheel anywhere
# (PyPI nor the jetson-ai-lab index ships only legacy cp35-cp37). With
# [all] uv backtracks through versions trying to satisfy decord and
# lands on sglang==0.1.16. The 0.5.0 floor matches the only major
# series the jetson-ai-lab sbsa/cu130 mirror currently publishes
# (sglang==0.5.1.post2 as of 2026-05-06). Bumping to >=0.5.11 here
# would make the build unsatisfiable until the mirror catches up.
# Gemma 4 / MTP recipes are therefore not supported on l4t13 — those
# features land on cublas12/cublas13 hosts that pull the newer wheel
# from PyPI. backend.py keeps backward compat with the 0.5.x SamplingParams
# field rename via runtime detection.
"sglang>=0.5.0",
# PyPI-resolvable packages that complete the runtime.
"accelerate",
"transformers",
]
[[tool.uv.index]]
name = "jetson-ai-lab"
url = "https://pypi.jetson-ai-lab.io/sbsa/cu130"
explicit = true
[tool.uv.sources]
torch = { index = "jetson-ai-lab" }
torchvision = { index = "jetson-ai-lab" }
torchaudio = { index = "jetson-ai-lab" }
sglang = { index = "jetson-ai-lab" }

View File

@@ -1,3 +1,4 @@
# Bump this pin deliberately — sglang releases weekly and API surfaces
# (FunctionCallParser, ReasoningParser) move between releases.
sglang[all]>=0.4.0
# 0.5.11 is the floor for Gemma 4 support (PR sgl-project/sglang#21952).
sglang[all]>=0.5.11

View File

@@ -1,5 +1,12 @@
# sglang 0.5.11 hard-pins torch==2.9.1. PyPI's default torch 2.9.1 wheel is
# now the cu130 build, which drags in cu130-flavoured sgl-kernel/sglang-kernel
# binaries that need libnvrtc.so.13 — incompatible with our cu12 host libs.
# Pin the cu128 PyTorch index so uv pulls cu12-flavoured torch (and the
# matching sgl-kernel cu12 wheels). install.sh adds --index-strategy=unsafe-best-match
# for cublas12 so uv consults this index alongside PyPI.
--extra-index-url https://download.pytorch.org/whl/cu128
accelerate
torch==2.7.1
torch==2.9.1
torchvision
torchaudio==2.7.1
torchaudio
transformers

View File

@@ -0,0 +1,4 @@
# Bump this pin deliberately — sglang releases weekly and API surfaces
# (FunctionCallParser, ReasoningParser) move between releases.
# 0.5.11 is the floor for Gemma 4 support (PR sgl-project/sglang#21952).
sglang[all]>=0.5.11

View File

@@ -0,0 +1,6 @@
--extra-index-url https://download.pytorch.org/whl/cu130
accelerate
torch
torchvision
torchaudio
transformers

View File

@@ -1,12 +0,0 @@
--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
accelerate
torch
torchvision
torchaudio
transformers
# Drop the [all] extra: it pulls outlines/decord, and decord has no
# aarch64 cp312 wheel anywhere (PyPI nor the jetson-ai-lab index ships
# only legacy cp35-cp37). With [all] uv backtracks through versions
# trying to satisfy decord and lands on sglang==0.1.16. Floor at 0.5.0
# so uv can't silently downgrade if a future resolution misfires.
sglang>=0.5.0

View File

@@ -0,0 +1,101 @@
"""Unit tests for the sglang backend.
Helper-level tests run without launching the gRPC server or loading model
weights — they only exercise the pure-Python helpers on
``BackendServicer``. They do still require ``sglang`` to be importable
because ``_apply_engine_args`` validates keys against
``ServerArgs``'s dataclass fields.
"""
import unittest
class TestSglangHelpers(unittest.TestCase):
"""Tests for the pure helpers on BackendServicer (no gRPC, no engine)."""
def _servicer(self):
import sys
import os
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from backend import BackendServicer # noqa: E402
return BackendServicer()
def test_parse_options(self):
servicer = self._servicer()
opts = servicer._parse_options([
"tool_parser:hermes",
"reasoning_parser:deepseek_r1",
"invalid_no_colon",
"key_with_colons:a:b:c",
])
self.assertEqual(opts["tool_parser"], "hermes")
self.assertEqual(opts["reasoning_parser"], "deepseek_r1")
self.assertEqual(opts["key_with_colons"], "a:b:c")
self.assertNotIn("invalid_no_colon", opts)
def test_apply_engine_args_known_keys(self):
"""User-supplied JSON merges into the kwargs dict; pre-set typed
fields stay put when not overridden."""
import json as _json
servicer = self._servicer()
base = {
"model_path": "facebook/opt-125m",
"mem_fraction_static": 0.7,
}
extras = _json.dumps({
"trust_remote_code": True,
"speculative_algorithm": "EAGLE",
"speculative_num_steps": 1,
})
out = servicer._apply_engine_args(base, extras)
self.assertIs(out, base) # in-place merge — same dict back
self.assertTrue(out["trust_remote_code"])
self.assertEqual(out["speculative_algorithm"], "EAGLE")
self.assertEqual(out["speculative_num_steps"], 1)
self.assertEqual(out["model_path"], "facebook/opt-125m")
self.assertEqual(out["mem_fraction_static"], 0.7)
def test_apply_engine_args_engine_args_overrides_typed_fields(self):
"""engine_args wins over previously-set typed kwargs (vLLM precedence)."""
import json as _json
servicer = self._servicer()
base = {"model_path": "facebook/opt-125m", "mem_fraction_static": 0.7}
out = servicer._apply_engine_args(
base, _json.dumps({"mem_fraction_static": 0.5}),
)
self.assertEqual(out["mem_fraction_static"], 0.5)
def test_apply_engine_args_unknown_key_raises(self):
"""Typo'd key raises ValueError with a close-match suggestion."""
import json as _json
servicer = self._servicer()
base = {"model_path": "facebook/opt-125m"}
with self.assertRaises(ValueError) as ctx:
servicer._apply_engine_args(
base, _json.dumps({"trust_remotecode": True}),
)
msg = str(ctx.exception)
self.assertIn("trust_remotecode", msg)
self.assertIn("trust_remote_code", msg)
def test_apply_engine_args_empty_passthrough(self):
"""Empty / None engine_args returns the kwargs dict untouched."""
servicer = self._servicer()
base = {"model_path": "facebook/opt-125m"}
self.assertIs(servicer._apply_engine_args(base, ""), base)
self.assertIs(servicer._apply_engine_args(base, None), base)
def test_apply_engine_args_invalid_json_raises(self):
servicer = self._servicer()
with self.assertRaises(ValueError) as ctx:
servicer._apply_engine_args({}, "not-json")
self.assertIn("not valid JSON", str(ctx.exception))
def test_apply_engine_args_non_object_raises(self):
servicer = self._servicer()
with self.assertRaises(ValueError) as ctx:
servicer._apply_engine_args({}, "[1,2,3]")
self.assertIn("must be a JSON object", str(ctx.exception))
if __name__ == "__main__":
unittest.main()

12
backend/python/sglang/test.sh Executable file
View File

@@ -0,0 +1,12 @@
#!/bin/bash
set -e
backend_dir=$(dirname $0)
if [ -d $backend_dir/common ]; then
source $backend_dir/common/libbackend.sh
else
source $backend_dir/../common/libbackend.sh
fi
runUnittests