mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-16 12:38:01 -04:00
Bring the sglang Python backend up to feature parity with vllm by adding
the same engine_args:-map plumbing the vLLM backend already has. Any
ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model
YAML, including the speculative-decoding flags needed for Multi-Token
Prediction. Validation matches the vllm backend's: keys are checked
against dataclasses.fields(ServerArgs), unknown keys raise ValueError
with a difflib close-match suggestion at LoadModel time, and the typed
ModelOptions fields keep their existing meaning with engine_args
overriding them.
Backend code:
* backend/python/sglang/backend.py: add _apply_engine_args, import
dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed ->
sampling_seed (sglang 0.5.11 renamed the SamplingParams field).
* backend/python/sglang/test.py + test.sh + Makefile: six unit tests
exercising the helper directly (no engine load required).
Build / CI / backend gallery (cuda13 + l4t13 paths are now first-class):
* backend/python/sglang/install.sh: add --prerelease=allow because
sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels;
add --index-strategy=unsafe-best-match for cublas12 so the cu128
torch index wins over default-PyPI's cu130; new pyproject.toml-driven
l4t13 install path so [tool.uv.sources] can pin torch/torchvision/
torchaudio/sglang to the jetson-ai-lab index without forcing every
transitive PyPI dep through the L4T mirror's flaky proxy (mirrors the
equivalent fix in backend/python/vllm/install.sh).
* backend/python/sglang/pyproject.toml (new): L4T project spec with
explicit-source jetson-ai-lab index. Replaces requirements-l4t13.txt
for the l4t13 BUILD_PROFILE; other profiles still go through the
requirements-*.txt pipeline via libbackend.sh's installRequirements.
* backend/python/sglang/requirements-l4t13.txt: removed; superseded
by pyproject.toml.
* backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin
sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13
(new files) and cu128 torch index for cublas12 (default PyPI now
ships cu130 torch wheels by default and breaks cu12 hosts).
* backend/index.yaml: add cuda13-sglang and cuda13-sglang-development
capability mappings + image entries pointing at
quay.io/.../-gpu-nvidia-cuda-13-sglang.
* .github/workflows/backend.yml: new cublas13 sglang matrix entry,
mirroring vllm's cuda13 build.
Model gallery + docs:
* gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml.
* gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos
transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands.
* gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads
+ online fp8 weight quantization, verified end-to-end on a 16 GB
RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the
MTP draft worker's vocab embedding is loaded unquantised and OOMs
the static reservation at sglang's 0.85 default.
* gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp,
gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang).
* docs/content/features/text-generation.md: new SGLang section with
setup, engine_args reference, MTP demos, version requirements.
* .agents/sglang-backend.md (new): agent one-pager covering the flat
ServerArgs structure, the typed-vs-engine_args precedence, the
speculative-decoding cheatsheet, and the mem_fraction_static gotcha
documented above.
* AGENTS.md: index entry for the new agent doc.
Known limitation: the two Gemma 4 MTP gallery entries ship a recipe
that doesn't yet run on stock libraries. The drafter checkpoints
(google/gemma-4-{E2B,E4B}-it-assistant) declare
model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which
neither transformers (<=5.6.0, including the SGLang cookbook's pinned
commit 91b1ab1f... and main HEAD) nor sglang's own model registry
(<=0.5.11) registers as of 2026-05-06. They will start working when
HF or sglang upstream registers the architecture -- no LocalAI
changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work
today on this build (verified on RTX 5070 Ti, 16 GB).
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
69 lines
3.0 KiB
TOML
69 lines
3.0 KiB
TOML
# L4T arm64 (JetPack 7 / sbsa cu130) install spec for the sglang backend.
|
|
#
|
|
# Why this file exists, and why only the l4t13 BUILD_PROFILE consumes it:
|
|
#
|
|
# pypi.jetson-ai-lab.io hosts the L4T-specific torch / sglang / sgl-kernel
|
|
# wheels we need on aarch64 + cuda13, but it ALSO transparently proxies the
|
|
# rest of PyPI through `/+f/<sha>/<filename>` URLs that 503 frequently.
|
|
# With `--extra-index-url` + `--index-strategy=unsafe-best-match` (the
|
|
# historical fix in install.sh) uv would pick those proxy URLs for ordinary
|
|
# PyPI packages — markdown-it-py, anthropic, propcache, etc. — and trip on
|
|
# the 503s. See e.g. CI run 25439791228 (markdown-it-py-4.0.0).
|
|
#
|
|
# `explicit = true` on the index makes uv consult the L4T mirror ONLY for
|
|
# packages mapped under [tool.uv.sources]. Everything else goes to PyPI.
|
|
# This breaks the historical 503 path without losing access to the L4T
|
|
# wheels we actually need from there. Mirrors the equivalent fix already
|
|
# in backend/python/vllm/pyproject.toml.
|
|
#
|
|
# `uv pip install -r requirements.txt` does NOT honor [tool.uv.sources]
|
|
# (sources are project-mode only, not pip-compat mode), so install.sh's
|
|
# l4t13 branch invokes `uv pip install --requirement pyproject.toml`
|
|
# directly. Other BUILD_PROFILEs continue to use the requirements-*.txt
|
|
# pipeline through libbackend.sh's installRequirements and never read
|
|
# this file.
|
|
[project]
|
|
name = "localai-sglang-l4t13"
|
|
version = "0.0.0"
|
|
requires-python = ">=3.12,<3.13"
|
|
dependencies = [
|
|
# Mirror of requirements.txt — kept in sync manually for now since the
|
|
# l4t13 path bypasses installRequirements (see install.sh).
|
|
"grpcio==1.80.0",
|
|
"protobuf",
|
|
"certifi",
|
|
"setuptools",
|
|
"pillow",
|
|
# L4T-specific accelerator stack (sourced from jetson-ai-lab below).
|
|
"torch",
|
|
"torchvision",
|
|
"torchaudio",
|
|
# sglang on jetson — the [all] extra is deliberately omitted because it
|
|
# pulls outlines/decord, and decord has no aarch64 cp312 wheel anywhere
|
|
# (PyPI nor the jetson-ai-lab index ships only legacy cp35-cp37). With
|
|
# [all] uv backtracks through versions trying to satisfy decord and
|
|
# lands on sglang==0.1.16. The 0.5.0 floor matches the only major
|
|
# series the jetson-ai-lab sbsa/cu130 mirror currently publishes
|
|
# (sglang==0.5.1.post2 as of 2026-05-06). Bumping to >=0.5.11 here
|
|
# would make the build unsatisfiable until the mirror catches up.
|
|
# Gemma 4 / MTP recipes are therefore not supported on l4t13 — those
|
|
# features land on cublas12/cublas13 hosts that pull the newer wheel
|
|
# from PyPI. backend.py keeps backward compat with the 0.5.x SamplingParams
|
|
# field rename via runtime detection.
|
|
"sglang>=0.5.0",
|
|
# PyPI-resolvable packages that complete the runtime.
|
|
"accelerate",
|
|
"transformers",
|
|
]
|
|
|
|
[[tool.uv.index]]
|
|
name = "jetson-ai-lab"
|
|
url = "https://pypi.jetson-ai-lab.io/sbsa/cu130"
|
|
explicit = true
|
|
|
|
[tool.uv.sources]
|
|
torch = { index = "jetson-ai-lab" }
|
|
torchvision = { index = "jetson-ai-lab" }
|
|
torchaudio = { index = "jetson-ai-lab" }
|
|
sglang = { index = "jetson-ai-lab" }
|