feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686)

Bring the sglang Python backend up to feature parity with vllm by adding
the same engine_args:-map plumbing the vLLM backend already has. Any
ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model
YAML, including the speculative-decoding flags needed for Multi-Token
Prediction. Validation matches the vllm backend's: keys are checked
against dataclasses.fields(ServerArgs), unknown keys raise ValueError
with a difflib close-match suggestion at LoadModel time, and the typed
ModelOptions fields keep their existing meaning with engine_args
overriding them.

Backend code:
* backend/python/sglang/backend.py: add _apply_engine_args, import
  dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed ->
  sampling_seed (sglang 0.5.11 renamed the SamplingParams field).
* backend/python/sglang/test.py + test.sh + Makefile: six unit tests
  exercising the helper directly (no engine load required).

Build / CI / backend gallery (cuda13 + l4t13 paths are now first-class):
* backend/python/sglang/install.sh: add --prerelease=allow because
  sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels;
  add --index-strategy=unsafe-best-match for cublas12 so the cu128
  torch index wins over default-PyPI's cu130; new pyproject.toml-driven
  l4t13 install path so [tool.uv.sources] can pin torch/torchvision/
  torchaudio/sglang to the jetson-ai-lab index without forcing every
  transitive PyPI dep through the L4T mirror's flaky proxy (mirrors the
  equivalent fix in backend/python/vllm/install.sh).
* backend/python/sglang/pyproject.toml (new): L4T project spec with
  explicit-source jetson-ai-lab index. Replaces requirements-l4t13.txt
  for the l4t13 BUILD_PROFILE; other profiles still go through the
  requirements-*.txt pipeline via libbackend.sh's installRequirements.
* backend/python/sglang/requirements-l4t13.txt: removed; superseded
  by pyproject.toml.
* backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin
  sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13
  (new files) and cu128 torch index for cublas12 (default PyPI now
  ships cu130 torch wheels by default and breaks cu12 hosts).
* backend/index.yaml: add cuda13-sglang and cuda13-sglang-development
  capability mappings + image entries pointing at
  quay.io/.../-gpu-nvidia-cuda-13-sglang.
* .github/workflows/backend.yml: new cublas13 sglang matrix entry,
  mirroring vllm's cuda13 build.

Model gallery + docs:
* gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml.
* gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos
  transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands.
* gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads
  + online fp8 weight quantization, verified end-to-end on a 16 GB
  RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the
  MTP draft worker's vocab embedding is loaded unquantised and OOMs
  the static reservation at sglang's 0.85 default.
* gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp,
  gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang).
* docs/content/features/text-generation.md: new SGLang section with
  setup, engine_args reference, MTP demos, version requirements.
* .agents/sglang-backend.md (new): agent one-pager covering the flat
  ServerArgs structure, the typed-vs-engine_args precedence, the
  speculative-decoding cheatsheet, and the mem_fraction_static gotcha
  documented above.
* AGENTS.md: index entry for the new agent doc.

Known limitation: the two Gemma 4 MTP gallery entries ship a recipe
that doesn't yet run on stock libraries. The drafter checkpoints
(google/gemma-4-{E2B,E4B}-it-assistant) declare
model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which
neither transformers (<=5.6.0, including the SGLang cookbook's pinned
commit 91b1ab1f... and main HEAD) nor sglang's own model registry
(<=0.5.11) registers as of 2026-05-06. They will start working when
HF or sglang upstream registers the architecture -- no LocalAI
changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work
today on this build (verified on RTX 5070 Ti, 16 GB).

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
This commit is contained in:
Richard Palethorpe
2026-05-07 16:27:29 +01:00
committed by GitHub
parent 048daa0cdc
commit c894d9c826
21 changed files with 732 additions and 21 deletions

62
.agents/sglang-backend.md Normal file
View File

@@ -0,0 +1,62 @@
# Working on the SGLang Backend
The SGLang backend lives at `backend/python/sglang/backend.py` (async gRPC). It wraps SGLang's `Engine` (`sglang.srt.entrypoints.engine.Engine`) and translates LocalAI's gRPC `PredictOptions` into SGLang sampling params + outputs into `Reply.chat_deltas`. Structurally it mirrors `backend/python/vllm/backend.py` — keep them shaped the same so changes in one have an obvious analog in the other.
## `engine_args` is the universal escape hatch
A small fixed set of fields on `ModelOptions` is mapped to typed SGLang kwargs in `LoadModel` (model, quantization, load_format, gpu_memory_utilization → mem_fraction_static, trust_remote_code, enforce_eager → disable_cuda_graph, tensor_parallel_size → tp_size, max_model_len → context_length, dtype). **Everything else** flows through the `engine_args:` YAML map.
Validation happens in `_apply_engine_args`. Keys are checked against `dataclasses.fields(ServerArgs)` (`sglang.srt.server_args.ServerArgs` is a flat `@dataclass` with ~380 fields). Unknown keys raise `ValueError` at LoadModel time with a `difflib.get_close_matches` suggestion — same shape as the vLLM backend.
**Precedence:** typed `ModelOptions` fields populate `engine_kwargs` first, then `engine_args` overrides them. So a YAML that sets both `gpu_memory_utilization: 0.9` and `engine_args.mem_fraction_static: 0.5` ends up at `0.5`. Document this when answering "why didn't my YAML field stick?".
**ServerArgs is flat.** Unlike vLLM, where speculative decoding is nested under `engine_args.speculative_config: {...}`, SGLang exposes flat top-level fields: `speculative_algorithm`, `speculative_draft_model_path`, `speculative_num_steps`, `speculative_eagle_topk`, `speculative_num_draft_tokens`, `speculative_dflash_block_size`, etc. There is no `speculative_config:` dict. Same goes for compilation, kv-transfer, attention — all flat.
The canonical reference is `python/sglang/srt/server_args.py:ServerArgs` (line ~304). When SGLang adds new flags, no LocalAI code change is needed — they're automatically available via `engine_args:`. The validator picks them up because it introspects the live dataclass.
## Speculative decoding cheatsheet
`--speculative-algorithm` accepts `EAGLE`, `EAGLE3`, `NEXTN`, `STANDALONE`, `NGRAM`, `DFLASH`. `NEXTN` is silently rewritten to `EAGLE` in `ServerArgs.__post_init__` (`server_args.py:3286-3287`). MTP (Multi-Token Prediction) is the same EAGLE path with `num_steps=1, eagle_topk=1, num_draft_tokens=2` against a target whose architecture has multi-token heads (e.g. MiMo-7B-RL, DeepSeek-V3-MTP).
| Algorithm | Drafter requirement | Gallery demo target | Gallery demo drafter |
|-----------|--------------------|---------------------|----------------------|
| `NEXTN` / `EAGLE` (MTP) | Assistant drafter or built-in heads | google/gemma-4-E2B-it, google/gemma-4-E4B-it | google/gemma-4-E2B-it-assistant, google/gemma-4-E4B-it-assistant |
| `EAGLE3` | EAGLE3 draft head | (no gallery entry yet) | e.g. jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B |
| `DFLASH` | Block-diffusion drafter | (no gallery entry yet) | e.g. z-lab/Qwen3-4B-DFlash-b16 |
| `STANDALONE` | Smaller LLM as drafter | (no gallery entry yet) | any smaller chat-tuned LLM in the same family |
| `NGRAM` | None — uses prefix history | (no gallery entry yet) | n/a |
The Gemma 4 demos use `mem_fraction_static: 0.85` (cookbook default) and the cookbook's `num_steps=5, num_draft_tokens=6, eagle_topk=1` parameters. Other algorithms are reachable from any user YAML via `engine_args:` but don't have shipped demos yet — that's a deliberate gallery scope choice, not a backend limitation.
Gemma 4 support requires sglang built from a commit that includes [PR #21952](https://github.com/sgl-project/sglang/pull/21952). LocalAI's pinned release for cublas12 / cublas13 includes it. The `l4t13` (JetPack 7 / sbsa cu130) build floors at `sglang>=0.5.0` because the `pypi.jetson-ai-lab.io` mirror still ships only `0.5.1.post2` as of 2026-05-06 — Gemma 4 / MTP recipes are therefore not available on l4t13 until that mirror catches up. `backend.py` keeps backward compat with the 0.5.x → 0.5.11 `SamplingParams.seed``sampling_seed` rename via runtime detection.
Compatibility caveats per the SGLang docs: DFLASH and NGRAM are incompatible with `enable_dp_attention`; DFLASH requires `pp_size == 1`; STANDALONE is incompatible with `enable_dp_attention`; NGRAM is CUDA-only and disables the overlap scheduler.
### `mem_fraction_static` + quantization + MTP on consumer GPUs
When combining online weight quantization (`engine_args.quantization: fp8` / `awq` / etc.) with built-in-head MTP (`speculative_algorithm: EAGLE`/`NEXTN`) on a tight VRAM budget, sglang's default `mem_fraction_static: 0.85` will OOM during draft-worker init. The reason: sglang quantizes the **target** model's transformer blocks but loads the **MTP draft worker's vocab embedding** at the source dtype (typically bf16). For a 7 B-class model with a 150k-token vocab × 4096 hidden, that's another ~1.2 GiB allocated *after* the static pool is reserved. At 0.85 fraction on a 16 GB card there's no room left.
Workaround: drop `mem_fraction_static` to ~0.7 so the post-static heap can absorb the MTP embedding alloc + CUDA graph private pools. Verified end-to-end on MiMo-7B-RL + fp8 + MTP on a 16 GB RTX 5070 Ti (`gallery/sglang-mimo-7b-mtp.yaml`) at ~88 tok/s. Models with larger vocabs or more MTP layers (e.g. DeepSeek-V3-MTP) need an even smaller fraction.
This isn't documented anywhere upstream as of 2026-05-06 — the SGLang Gemma 4 cookbook uses 0.85 because their MTP path doesn't go through `eagle_worker_v2.py` for an embedding-bearing draft module. Don't blanket-apply 0.7 across all sglang YAMLs; only when MTP-with-built-in-heads + quantization combine.
## Tool-call and reasoning parsers stay on `Options[]`
ServerArgs has `tool_call_parser` and `reasoning_parser` fields, and the backend does pass them through to `Engine` so SGLang's own HTTP/OAI surface keeps working. But for the **LocalAI** request path the backend constructs fresh per-request parser instances in `_make_parsers` (`backend.py:286`) because the parsers are stateful — the streaming and non-streaming paths each need their own.
So the user-facing knob stays on `Options[]`:
```yaml
options:
- tool_parser:hermes
- reasoning_parser:deepseek_r1
```
Putting these in `engine_args:` will set them on `ServerArgs` but the LocalAI-level streaming `ChatDelta` will not pick them up. Don't recommend that path.
## What's missing today (out of scope, but worth tracking)
- `core/config/hooks_sglang.go` — there is no SGLang equivalent of `hooks_vllm.go`. The vLLM hook auto-selects parsers for known model families from `parser_defaults.json` and seeds production engine_args defaults. A symmetric hook for SGLang could reuse the same `parser_defaults.json` (the SGLang parser names are different but the family detection is shared) and seed defaults like `enable_metrics: true` or attention-backend choices.
- `core/gallery/importers/sglang.go` — vLLM has an importer that resolves model architecture → parser defaults at gallery-import time. A matching importer for SGLang would let `local-ai install` populate sensible parsers automatically.
These should be a follow-up PR, not a blocker for the engine_args feature.

View File

@@ -959,6 +959,19 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-sglang'
runs-on: 'arc-runner-set'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "sglang"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"

View File

@@ -24,6 +24,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
| [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
| [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
| [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
| [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
| [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
| [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
| [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |

View File

@@ -287,6 +287,7 @@
amd: "rocm-sglang"
intel: "intel-sglang"
nvidia-cuda-12: "cuda12-sglang"
nvidia-cuda-13: "cuda13-sglang"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sglang"
cpu: "cpu-sglang"
- &vllm-omni
@@ -1965,6 +1966,7 @@
amd: "rocm-sglang-development"
intel: "intel-sglang-development"
nvidia-cuda-12: "cuda12-sglang-development"
nvidia-cuda-13: "cuda13-sglang-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sglang-development"
cpu: "cpu-sglang-development"
- !!merge <<: *sglang
@@ -1972,6 +1974,11 @@
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-sglang"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-sglang
- !!merge <<: *sglang
name: "cuda13-sglang"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-sglang"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-sglang
- !!merge <<: *sglang
name: "cuda13-nvidia-l4t-arm64-sglang"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-sglang"
@@ -1997,6 +2004,11 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sglang"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-sglang
- !!merge <<: *sglang
name: "cuda13-sglang-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-sglang"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-sglang
- !!merge <<: *sglang
name: "cuda13-nvidia-l4t-arm64-sglang-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-sglang"

View File

@@ -8,6 +8,12 @@ run: sglang
bash run.sh
@echo "sglang run."
.PHONY: test
test: sglang
@echo "Testing sglang..."
bash test.sh
@echo "sglang tested."
.PHONY: protogen-clean
protogen-clean:
$(RM) backend_pb2_grpc.py backend_pb2.py

View File

@@ -9,10 +9,18 @@ The streaming path applies sglang's per-request FunctionCallParser and
ReasoningParser so tool_calls and reasoning_content are emitted
incrementally inside ChatDelta, which is a capability sglang exposes
natively and vLLM does not.
Like the vLLM backend, this one accepts an arbitrary ``engine_args:``
map in the model YAML; keys are validated against ``ServerArgs`` fields
and forwarded to ``Engine(**kwargs)``. That covers speculative decoding
(EAGLE/EAGLE3/DFLASH/NGRAM/STANDALONE plus MTP via NEXTN), attention
backend selection, MoE knobs, hierarchical cache, and so on.
"""
import asyncio
from concurrent import futures
import argparse
import dataclasses
import difflib
import signal
import sys
import os
@@ -38,6 +46,7 @@ from grpc_auth import get_auth_interceptors
# are wrapped in try/except so older / leaner installs that omit them
# still load the backend for plain text generation.
from sglang.srt.entrypoints.engine import Engine
from sglang.srt.server_args import ServerArgs
try:
from sglang.srt.function_call.function_call_parser import FunctionCallParser
@@ -66,6 +75,19 @@ except Exception:
HAS_TRANSFORMERS = False
# sglang 0.5.11 renamed SamplingParams.seed -> sampling_seed (PR #21952).
# Earlier 0.5.x releases (e.g. 0.5.1.post2 — the wheel still pinned by the
# pypi.jetson-ai-lab.io sbsa/cu130 mirror used by the l4t13 build profile)
# accept only `seed`. Detect the supported keyword once at import time so
# both versions work without a hard pin floor.
try:
import inspect as _inspect
from sglang.srt.sampling.sampling_params import SamplingParams as _SamplingParams
_SEED_KEY = "sampling_seed" if "sampling_seed" in _inspect.signature(_SamplingParams).parameters else "seed"
except Exception:
_SEED_KEY = "sampling_seed"
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1'))
@@ -82,6 +104,37 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
opts[key.strip()] = value.strip()
return opts
def _apply_engine_args(self, engine_kwargs: dict, engine_args_json: str) -> dict:
"""Merge user-supplied engine_args (JSON object) into the kwargs dict
that will be forwarded to ``sglang.Engine`` (which constructs a
``ServerArgs`` from them).
Mirrors ``backend/python/vllm/backend.py::_apply_engine_args`` but
operates on the kwargs dict because sglang's ``Engine.__init__``
accepts ``**kwargs`` directly rather than a pre-built dataclass.
Validation happens against ``ServerArgs`` fields so a typo fails
early with a close-match suggestion instead of producing a confusing
``TypeError`` deep inside engine startup.
"""
if not engine_args_json:
return engine_kwargs
try:
extra = json.loads(engine_args_json)
except json.JSONDecodeError as e:
raise ValueError(f"engine_args is not valid JSON: {e}") from e
if not isinstance(extra, dict):
raise ValueError(
f"engine_args must be a JSON object, got {type(extra).__name__}"
)
valid = {f.name for f in dataclasses.fields(ServerArgs)}
for key in extra:
if key not in valid:
suggestion = difflib.get_close_matches(key, valid, n=1)
hint = f" did you mean {suggestion[0]!r}?" if suggestion else ""
raise ValueError(f"unknown engine_args key {key!r}.{hint}")
engine_kwargs.update(extra)
return engine_kwargs
def _messages_to_dicts(self, messages) -> List[dict]:
result: List[dict] = []
for msg in messages:
@@ -137,6 +190,16 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
if self.reasoning_parser_name:
engine_kwargs["reasoning_parser"] = self.reasoning_parser_name
# engine_args from YAML overrides typed fields above so operators can
# tune anything ServerArgs exposes (speculative decoding, attention
# backend, MoE, hierarchical cache, …) without waiting on protobuf
# changes.
try:
engine_kwargs = self._apply_engine_args(engine_kwargs, request.EngineArgs)
except ValueError as err:
print(f"engine_args error: {err}", file=sys.stderr)
return backend_pb2.Result(success=False, message=str(err))
try:
self.llm = Engine(**engine_kwargs)
except Exception as err:
@@ -221,7 +284,7 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
"TopP": "top_p",
"TopK": "top_k",
"MinP": "min_p",
"Seed": "seed",
"Seed": _SEED_KEY,
"StopPrompts": "stop",
"StopTokenIds": "stop_token_ids",
"IgnoreEOS": "ignore_eos",

View File

@@ -23,17 +23,32 @@ if [ "x${BUILD_PROFILE}" == "xcpu" ]; then
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
fi
# cublas12 needs a cu128 torch index (see requirements-cublas12.txt) — without
# unsafe-best-match uv falls through to default PyPI's cu130 torch wheel and
# the resulting sgl-kernel can't load on our cu12 host libs.
if [ "x${BUILD_PROFILE}" == "xcublas12" ]; then
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
fi
# sglang 0.5.11 (Gemma 4 support) declares flash-attn-4 as a hard dep, but
# upstream only publishes pre-release wheels (4.0.0b*). uv rejects
# pre-releases by default — opt in for sglang specifically. Drop this once
# flash-attn-4 4.0 stable lands.
EXTRA_PIP_INSTALL_FLAGS+=" --prerelease=allow"
# JetPack 7 / L4T arm64 wheels are built for cp312 and shipped via
# pypi.jetson-ai-lab.io. Bump the venv Python so the prebuilt sglang
# wheel resolves cleanly. unsafe-best-match is required because the
# jetson-ai-lab index lists transitive deps (e.g. decord) at older
# versions only — without it uv refuses to fall through to PyPI for a
# compatible wheel and resolution fails.
# wheel resolves cleanly. The actual install on l4t13 goes through
# pyproject.toml (see the elif branch below) so [tool.uv.sources] can
# pin only torch/torchvision/torchaudio/sglang to the jetson-ai-lab
# index — leaving PyPI as the path for transitive deps like
# markdown-it-py / anthropic / propcache that the L4T mirror's proxy
# 503s on. No --index-strategy flag here: the explicit index keeps the
# scoping clean.
if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
PYTHON_VERSION="3.12"
PYTHON_PATCH="12"
PY_STANDALONE_TAG="20251120"
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
fi
# sglang's CPU path has no prebuilt wheel on PyPI — upstream publishes
@@ -95,6 +110,27 @@ if [ "x${BUILD_TYPE}" == "x" ] || [ "x${FROM_SOURCE:-}" == "xtrue" ]; then
fi
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} .
popd
# L4T arm64 (JetPack 7): drive the install through pyproject.toml so that
# [tool.uv.sources] can pin torch/torchvision/torchaudio/sglang to the
# jetson-ai-lab index, while everything else (transitive deps and
# PyPI-resolvable packages like transformers / accelerate) comes from
# PyPI. Bypasses installRequirements because uv pip install -r
# requirements.txt does not honor sources — see
# backend/python/sglang/pyproject.toml for the rationale. Mirrors the
# equivalent path in backend/python/vllm/install.sh.
elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
ensureVenv
if [ "x${PORTABLE_PYTHON}" == "xtrue" ]; then
export C_INCLUDE_PATH="${C_INCLUDE_PATH:-}:$(_portable_dir)/include/python${PYTHON_VERSION}"
fi
pushd "${backend_dir}"
# Build deps first (matches installRequirements' requirements-install.txt
# pass — sglang/sgl-kernel sdists need packaging/setuptools-scm in the
# venv before they can build under --no-build-isolation).
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} -r requirements-install.txt
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --requirement pyproject.toml
popd
runProtogen
else
installRequirements
fi

View File

@@ -0,0 +1,68 @@
# L4T arm64 (JetPack 7 / sbsa cu130) install spec for the sglang backend.
#
# Why this file exists, and why only the l4t13 BUILD_PROFILE consumes it:
#
# pypi.jetson-ai-lab.io hosts the L4T-specific torch / sglang / sgl-kernel
# wheels we need on aarch64 + cuda13, but it ALSO transparently proxies the
# rest of PyPI through `/+f/<sha>/<filename>` URLs that 503 frequently.
# With `--extra-index-url` + `--index-strategy=unsafe-best-match` (the
# historical fix in install.sh) uv would pick those proxy URLs for ordinary
# PyPI packages — markdown-it-py, anthropic, propcache, etc. — and trip on
# the 503s. See e.g. CI run 25439791228 (markdown-it-py-4.0.0).
#
# `explicit = true` on the index makes uv consult the L4T mirror ONLY for
# packages mapped under [tool.uv.sources]. Everything else goes to PyPI.
# This breaks the historical 503 path without losing access to the L4T
# wheels we actually need from there. Mirrors the equivalent fix already
# in backend/python/vllm/pyproject.toml.
#
# `uv pip install -r requirements.txt` does NOT honor [tool.uv.sources]
# (sources are project-mode only, not pip-compat mode), so install.sh's
# l4t13 branch invokes `uv pip install --requirement pyproject.toml`
# directly. Other BUILD_PROFILEs continue to use the requirements-*.txt
# pipeline through libbackend.sh's installRequirements and never read
# this file.
[project]
name = "localai-sglang-l4t13"
version = "0.0.0"
requires-python = ">=3.12,<3.13"
dependencies = [
# Mirror of requirements.txt — kept in sync manually for now since the
# l4t13 path bypasses installRequirements (see install.sh).
"grpcio==1.80.0",
"protobuf",
"certifi",
"setuptools",
"pillow",
# L4T-specific accelerator stack (sourced from jetson-ai-lab below).
"torch",
"torchvision",
"torchaudio",
# sglang on jetson — the [all] extra is deliberately omitted because it
# pulls outlines/decord, and decord has no aarch64 cp312 wheel anywhere
# (PyPI nor the jetson-ai-lab index ships only legacy cp35-cp37). With
# [all] uv backtracks through versions trying to satisfy decord and
# lands on sglang==0.1.16. The 0.5.0 floor matches the only major
# series the jetson-ai-lab sbsa/cu130 mirror currently publishes
# (sglang==0.5.1.post2 as of 2026-05-06). Bumping to >=0.5.11 here
# would make the build unsatisfiable until the mirror catches up.
# Gemma 4 / MTP recipes are therefore not supported on l4t13 — those
# features land on cublas12/cublas13 hosts that pull the newer wheel
# from PyPI. backend.py keeps backward compat with the 0.5.x SamplingParams
# field rename via runtime detection.
"sglang>=0.5.0",
# PyPI-resolvable packages that complete the runtime.
"accelerate",
"transformers",
]
[[tool.uv.index]]
name = "jetson-ai-lab"
url = "https://pypi.jetson-ai-lab.io/sbsa/cu130"
explicit = true
[tool.uv.sources]
torch = { index = "jetson-ai-lab" }
torchvision = { index = "jetson-ai-lab" }
torchaudio = { index = "jetson-ai-lab" }
sglang = { index = "jetson-ai-lab" }

View File

@@ -1,3 +1,4 @@
# Bump this pin deliberately — sglang releases weekly and API surfaces
# (FunctionCallParser, ReasoningParser) move between releases.
sglang[all]>=0.4.0
# 0.5.11 is the floor for Gemma 4 support (PR sgl-project/sglang#21952).
sglang[all]>=0.5.11

View File

@@ -1,5 +1,12 @@
# sglang 0.5.11 hard-pins torch==2.9.1. PyPI's default torch 2.9.1 wheel is
# now the cu130 build, which drags in cu130-flavoured sgl-kernel/sglang-kernel
# binaries that need libnvrtc.so.13 — incompatible with our cu12 host libs.
# Pin the cu128 PyTorch index so uv pulls cu12-flavoured torch (and the
# matching sgl-kernel cu12 wheels). install.sh adds --index-strategy=unsafe-best-match
# for cublas12 so uv consults this index alongside PyPI.
--extra-index-url https://download.pytorch.org/whl/cu128
accelerate
torch==2.7.1
torch==2.9.1
torchvision
torchaudio==2.7.1
torchaudio
transformers

View File

@@ -0,0 +1,4 @@
# Bump this pin deliberately — sglang releases weekly and API surfaces
# (FunctionCallParser, ReasoningParser) move between releases.
# 0.5.11 is the floor for Gemma 4 support (PR sgl-project/sglang#21952).
sglang[all]>=0.5.11

View File

@@ -0,0 +1,6 @@
--extra-index-url https://download.pytorch.org/whl/cu130
accelerate
torch
torchvision
torchaudio
transformers

View File

@@ -1,12 +0,0 @@
--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
accelerate
torch
torchvision
torchaudio
transformers
# Drop the [all] extra: it pulls outlines/decord, and decord has no
# aarch64 cp312 wheel anywhere (PyPI nor the jetson-ai-lab index ships
# only legacy cp35-cp37). With [all] uv backtracks through versions
# trying to satisfy decord and lands on sglang==0.1.16. Floor at 0.5.0
# so uv can't silently downgrade if a future resolution misfires.
sglang>=0.5.0

View File

@@ -0,0 +1,101 @@
"""Unit tests for the sglang backend.
Helper-level tests run without launching the gRPC server or loading model
weights — they only exercise the pure-Python helpers on
``BackendServicer``. They do still require ``sglang`` to be importable
because ``_apply_engine_args`` validates keys against
``ServerArgs``'s dataclass fields.
"""
import unittest
class TestSglangHelpers(unittest.TestCase):
"""Tests for the pure helpers on BackendServicer (no gRPC, no engine)."""
def _servicer(self):
import sys
import os
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from backend import BackendServicer # noqa: E402
return BackendServicer()
def test_parse_options(self):
servicer = self._servicer()
opts = servicer._parse_options([
"tool_parser:hermes",
"reasoning_parser:deepseek_r1",
"invalid_no_colon",
"key_with_colons:a:b:c",
])
self.assertEqual(opts["tool_parser"], "hermes")
self.assertEqual(opts["reasoning_parser"], "deepseek_r1")
self.assertEqual(opts["key_with_colons"], "a:b:c")
self.assertNotIn("invalid_no_colon", opts)
def test_apply_engine_args_known_keys(self):
"""User-supplied JSON merges into the kwargs dict; pre-set typed
fields stay put when not overridden."""
import json as _json
servicer = self._servicer()
base = {
"model_path": "facebook/opt-125m",
"mem_fraction_static": 0.7,
}
extras = _json.dumps({
"trust_remote_code": True,
"speculative_algorithm": "EAGLE",
"speculative_num_steps": 1,
})
out = servicer._apply_engine_args(base, extras)
self.assertIs(out, base) # in-place merge — same dict back
self.assertTrue(out["trust_remote_code"])
self.assertEqual(out["speculative_algorithm"], "EAGLE")
self.assertEqual(out["speculative_num_steps"], 1)
self.assertEqual(out["model_path"], "facebook/opt-125m")
self.assertEqual(out["mem_fraction_static"], 0.7)
def test_apply_engine_args_engine_args_overrides_typed_fields(self):
"""engine_args wins over previously-set typed kwargs (vLLM precedence)."""
import json as _json
servicer = self._servicer()
base = {"model_path": "facebook/opt-125m", "mem_fraction_static": 0.7}
out = servicer._apply_engine_args(
base, _json.dumps({"mem_fraction_static": 0.5}),
)
self.assertEqual(out["mem_fraction_static"], 0.5)
def test_apply_engine_args_unknown_key_raises(self):
"""Typo'd key raises ValueError with a close-match suggestion."""
import json as _json
servicer = self._servicer()
base = {"model_path": "facebook/opt-125m"}
with self.assertRaises(ValueError) as ctx:
servicer._apply_engine_args(
base, _json.dumps({"trust_remotecode": True}),
)
msg = str(ctx.exception)
self.assertIn("trust_remotecode", msg)
self.assertIn("trust_remote_code", msg)
def test_apply_engine_args_empty_passthrough(self):
"""Empty / None engine_args returns the kwargs dict untouched."""
servicer = self._servicer()
base = {"model_path": "facebook/opt-125m"}
self.assertIs(servicer._apply_engine_args(base, ""), base)
self.assertIs(servicer._apply_engine_args(base, None), base)
def test_apply_engine_args_invalid_json_raises(self):
servicer = self._servicer()
with self.assertRaises(ValueError) as ctx:
servicer._apply_engine_args({}, "not-json")
self.assertIn("not valid JSON", str(ctx.exception))
def test_apply_engine_args_non_object_raises(self):
servicer = self._servicer()
with self.assertRaises(ValueError) as ctx:
servicer._apply_engine_args({}, "[1,2,3]")
self.assertIn("must be a JSON object", str(ctx.exception))
if __name__ == "__main__":
unittest.main()

12
backend/python/sglang/test.sh Executable file
View File

@@ -0,0 +1,12 @@
#!/bin/bash
set -e
backend_dir=$(dirname $0)
if [ -d $backend_dir/common ]; then
source $backend_dir/common/libbackend.sh
else
source $backend_dir/../common/libbackend.sh
fi
runUnittests

View File

@@ -721,6 +721,121 @@ GPU nodes. See [vLLM Multi-Node (Data-Parallel)]({{% relref
"features/distributed-mode#vllm-multi-node-data-parallel" %}})
for the head/follower configuration and a worked Kimi-K2.6 example.
### SGLang
[SGLang](https://github.com/sgl-project/sglang) is a fast serving
framework for LLMs and VLMs with a focus on prefix caching, speculative
decoding, and multi-modal generation. LocalAI ships a gRPC backend that
wraps SGLang's async `Engine`, including its native function-call and
reasoning parsers.
#### Setup
```yaml
name: sglang
backend: sglang
parameters:
model: "Qwen/Qwen3-4B"
template:
use_tokenizer_template: true
```
The backend will pull the model from HuggingFace on first load.
#### Passing arbitrary SGLang options with `engine_args`
The same `engine_args:` map that the vLLM backend accepts is also
honoured by the SGLang backend. Keys are validated against
[`ServerArgs`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py)
— SGLang's central configuration dataclass — and forwarded verbatim to
`Engine(**kwargs)`. Unknown keys fail at load time with the closest
valid name as a hint. Unlike vLLM, `ServerArgs` is flat: speculative
decoding fields are top-level (`speculative_algorithm`,
`speculative_draft_model_path`, etc.) rather than nested under a
`speculative_config:` dict.
The typed YAML fields shared with vLLM are mapped to their SGLang
equivalents (`gpu_memory_utilization``mem_fraction_static`,
`enforce_eager``disable_cuda_graph`, `tensor_parallel_size`
`tp_size`, `max_model_len``context_length`). Anything else,
including all speculative-decoding flags, goes under `engine_args:`.
##### Speculative decoding: Gemma 4 with Multi-Token Prediction
Google publishes paired "assistant" drafters for every Gemma 4 size.
The drafters use Multi-Token Prediction (MTP) to propose several
candidate tokens per target step, which SGLang then verifies in
parallel. Flags below are transcribed verbatim from the
[SGLang Gemma 4 cookbook](https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#speculative-decoding-mtp-server-commands).
For consumer GPUs in the 1624 GB range, use **E4B** (8 B total /
4 B effective parameters):
```yaml
name: gemma-4-e4b-mtp
backend: sglang
parameters:
model: google/gemma-4-E4B-it
context_size: 4096
template:
use_tokenizer_template: true
options:
- tool_parser:gemma4
- reasoning_parser:gemma4
engine_args:
mem_fraction_static: 0.85
speculative_algorithm: NEXTN
speculative_draft_model_path: google/gemma-4-E4B-it-assistant
speculative_num_steps: 5
speculative_num_draft_tokens: 6
speculative_eagle_topk: 1
```
For smaller cards (812 GB), drop to **E2B** (5 B total / 2 B effective)
by swapping the model paths to `google/gemma-4-E2B-it` and
`google/gemma-4-E2B-it-assistant`; the rest of the flags stay the same.
`NEXTN` is normalised to `EAGLE` inside `ServerArgs.__post_init__`, so
either value works — the cookbook uses `NEXTN`. `mem_fraction_static`
is the share of GPU memory SGLang reserves for the model + KV pool;
0.85 is the cookbook's default and adapts to whatever single GPU the
backend is running on.
The 31 B dense and 26 B-A4B MoE Gemma 4 variants exist in the same
cookbook but require `--tp-size 2`, so they're not in the gallery as
single-GPU recipes.
> **SGLang version requirement.** Gemma 4 support landed in SGLang via
> [PR #21952](https://github.com/sgl-project/sglang/pull/21952). The
> LocalAI sglang backend pins a release that includes it; if you've
> overridden the pin to an older version, this recipe will fail with a
> "model architecture not recognised" error at load time.
##### Other speculative algorithms
`speculative_algorithm:` also accepts `EAGLE`/`EAGLE3` (paired with an
EAGLE-style draft head), `DFLASH` (block-diffusion drafters from
[z-lab](https://huggingface.co/z-lab) for the Qwen3 family), `STANDALONE`
(a smaller draft LLM verifying a larger target), and `NGRAM` (no draft
model — pure prefix-history speculation). See SGLang's
[speculative-decoding docs](https://docs.sglang.io/advanced_features/speculative_decoding.html)
for the full algorithm matrix.
#### Tool calling and reasoning parsers
SGLang's native parsers stream `tool_calls` and `reasoning_content`
inside `ChatDelta` — the LocalAI Python backend wires them up
per-request rather than via `engine_args:`. Pick a parser by name:
```yaml
options:
- tool_parser:hermes
- reasoning_parser:deepseek_r1
```
The full list of registered parsers lives in `sglang.srt.function_call`
and `sglang.srt.parser.reasoning_parser`.
### Transformers
[Transformers](https://huggingface.co/docs/transformers/index) is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX.

View File

@@ -24108,6 +24108,73 @@
overrides:
parameters:
model: NousResearch/Hermes-3-Llama-3.1-405B
- &gemma4-sglang-mtp
name: "gemma-4-e2b-it:sglang-mtp"
url: "github:mudler/LocalAI/gallery/sglang-gemma-4-e2b-mtp.yaml@master"
icon: https://ai.google.dev/static/gemma/images/gemma3.png
license: gemma
urls:
- https://huggingface.co/google/gemma-4-E2B-it
- https://huggingface.co/google/gemma-4-E2B-it-assistant
- https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4
description: |
Google Gemma 4 E2B-IT served by SGLang with Multi-Token Prediction
(MTP) speculative decoding. The companion drafter
google/gemma-4-E2B-it-assistant lets the target accept several
tokens per step. Flags are a 1:1 transcription of the SGLang
cookbook's MTP command (NEXTN algorithm, num_steps=5,
num_draft_tokens=6, eagle_topk=1, mem_fraction_static=0.85). The
E2B variant has 5B total / 2B effective parameters and targets the
smaller end of consumer GPUs.
tags:
- llm
- sglang
- gpu
- speculative-decoding
- mtp
- gemma
- gemma4
- gemma-4
- !!merge <<: *gemma4-sglang-mtp
name: "gemma-4-e4b-it:sglang-mtp"
url: "github:mudler/LocalAI/gallery/sglang-gemma-4-e4b-mtp.yaml@master"
urls:
- https://huggingface.co/google/gemma-4-E4B-it
- https://huggingface.co/google/gemma-4-E4B-it-assistant
- https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4
description: |
Google Gemma 4 E4B-IT served by SGLang with Multi-Token Prediction
(MTP) speculative decoding. The companion drafter
google/gemma-4-E4B-it-assistant lets the target accept several
tokens per step. Flags are a 1:1 transcription of the SGLang
cookbook's MTP command (NEXTN algorithm, num_steps=5,
num_draft_tokens=6, eagle_topk=1, mem_fraction_static=0.85). The
E4B variant has 8B total / 4B effective parameters — the natural
pick for consumer GPUs in the 1624 GB range.
- name: "mimo-7b-mtp:sglang"
url: "github:mudler/LocalAI/gallery/sglang-mimo-7b-mtp.yaml@master"
icon: https://github.com/XiaomiMiMo/MiMo/raw/main/figures/Xiaomi_MiMo.png
license: mit
urls:
- https://huggingface.co/XiaomiMiMo/MiMo-7B-RL
- https://github.com/XiaomiMiMo/MiMo
description: |
Xiaomi MiMo-7B-RL served by SGLang with built-in Multi-Token
Prediction (MTP) heads (no separate drafter needed) plus online fp8
weight quantization to fit on a 16 GB consumer GPU. ~90% acceptance
per the model card. Verified end-to-end at ~88 tok/s on an RTX 5070
Ti (16 GB). Note: mem_fraction_static is dropped to 0.7 (vs sglang's
0.85 default) because the MTP draft worker's vocab embedding is
loaded unquantised (~1.2 GiB) and OOMs the static reservation
otherwise.
tags:
- llm
- sglang
- gpu
- speculative-decoding
- mtp
- reasoning
- fp8
- name: codellama-7b
url: github:mudler/LocalAI/gallery/codellama.yaml@master
urls:

View File

@@ -0,0 +1,36 @@
---
name: "sglang-gemma-4-e2b-mtp"
config_file: |
backend: sglang
parameters:
model: google/gemma-4-E2B-it
max_tokens: 4096
context_size: 4096
function:
disable_no_action: true
grammar:
disable: true
parallel_calls: true
expect_strings_after_json: true
template:
use_tokenizer_template: true
options:
- tool_parser:gemma4
- reasoning_parser:gemma4
# Gemma 4 E2B-it served by SGLang with Multi-Token Prediction (MTP).
# Flags transcribed verbatim from the SGLang cookbook:
# https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#speculative-decoding-mtp-server-commands
# NEXTN is normalised to EAGLE inside ServerArgs.__post_init__.
# mem_fraction_static=0.85 adapts to the available GPU; E2B is the
# smaller variant of the Gemma 4 lineup and the natural fit for
# consumer GPUs (notably 812 GB cards). Requires sglang built with
# PR #21952 (Gemma 4 model support); LocalAI's pinned release
# carries it.
engine_args:
mem_fraction_static: 0.85
speculative_algorithm: NEXTN
speculative_draft_model_path: google/gemma-4-E2B-it-assistant
speculative_num_steps: 5
speculative_num_draft_tokens: 6
speculative_eagle_topk: 1

View File

@@ -0,0 +1,36 @@
---
name: "sglang-gemma-4-e4b-mtp"
config_file: |
backend: sglang
parameters:
model: google/gemma-4-E4B-it
max_tokens: 4096
context_size: 4096
function:
disable_no_action: true
grammar:
disable: true
parallel_calls: true
expect_strings_after_json: true
template:
use_tokenizer_template: true
options:
- tool_parser:gemma4
- reasoning_parser:gemma4
# Gemma 4 E4B-it served by SGLang with Multi-Token Prediction (MTP).
# Flags transcribed verbatim from the SGLang cookbook:
# https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#speculative-decoding-mtp-server-commands
# NEXTN is normalised to EAGLE inside ServerArgs.__post_init__.
# mem_fraction_static=0.85 adapts to the available GPU; E4B is the
# mid-size variant (8B total / 4B effective parameters) and targets
# consumer GPUs in the 1624 GB range. Requires sglang built with
# PR #21952 (Gemma 4 model support); LocalAI's pinned release
# carries it.
engine_args:
mem_fraction_static: 0.85
speculative_algorithm: NEXTN
speculative_draft_model_path: google/gemma-4-E4B-it-assistant
speculative_num_steps: 5
speculative_num_draft_tokens: 6
speculative_eagle_topk: 1

View File

@@ -0,0 +1,34 @@
---
name: "sglang-mimo-7b-mtp"
config_file: |
backend: sglang
parameters:
model: XiaomiMiMo/MiMo-7B-RL
max_tokens: 4096
context_size: 4096
trust_remote_code: true
function:
disable_no_action: true
grammar:
disable: true
parallel_calls: true
expect_strings_after_json: true
template:
use_tokenizer_template: true
# Xiaomi MiMo-7B-RL with built-in Multi-Token Prediction (MTP) heads
# served via SGLang's EAGLE-aliased speculative-decoding path. ~90%
# acceptance rate per the model card. Quantised to fp8 at load time
# so the 7 B target fits on a 16 GB consumer GPU; mem_fraction_static
# is reduced from sglang's 0.85 default because the MTP draft worker
# loads its vocab embedding unquantised (bf16, ~1.2 GiB for MiMo's
# 152k vocab × 4096 hidden) and OOMs at 0.85. Verified end-to-end on
# an RTX 5070 Ti (16 GB) at ~88 tok/s.
engine_args:
dtype: bfloat16
quantization: fp8
mem_fraction_static: 0.7
speculative_algorithm: EAGLE
speculative_num_steps: 1
speculative_eagle_topk: 1
speculative_num_draft_tokens: 2

43
gallery/sglang.yaml Normal file
View File

@@ -0,0 +1,43 @@
---
name: "sglang"
config_file: |
backend: sglang
context_size: 8192
parameters:
max_tokens: 8192
function:
disable_no_action: true
grammar:
disable: true
parallel_calls: true
expect_strings_after_json: true
template:
use_tokenizer_template: true
# Uncomment to specify a quantization method (optional)
# quantization: "fp8"
# Uncomment to set dtype: "auto", "half", "float16", "bfloat16", "float", "float32"
# dtype: "bfloat16"
# Uncomment to limit static GPU memory (sglang's mem_fraction_static — analogous to vLLM gpu_memory_utilization)
# gpu_memory_utilization: 0.75
# Uncomment to trust remote code from huggingface
# trust_remote_code: true
# Uncomment to disable CUDA graph capture (sglang's disable_cuda_graph)
# enforce_eager: true
# Uncomment to specify the maximum length of a sequence (sglang's context_length)
# max_model_len: 32768
# Uncomment and specify the number of Tensor divisions
# tensor_parallel_size: 2
#
# Anything ServerArgs exposes (~380 fields including speculative
# decoding, attention backend, MoE/EP, hierarchical cache, …) can be
# passed verbatim under engine_args:. See
# https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py
# for the canonical list. Unknown keys fail at load time with a
# close-match suggestion.
# engine_args:
# speculative_algorithm: EAGLE3
# speculative_draft_model_path: lmsys/SGLang-EAGLE3-Llama-3.1-8B-Instruct-SpecForge
# speculative_num_steps: 3
# speculative_eagle_topk: 4
# speculative_num_draft_tokens: 16