feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686)

Bring the sglang Python backend up to feature parity with vllm by adding the same engine_args:-map plumbing the vLLM backend already has. Any ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model YAML, including the speculative-decoding flags needed for Multi-Token Prediction. Validation matches the vllm backend's: keys are checked against dataclasses.fields(ServerArgs), unknown keys raise ValueError with a difflib close-match suggestion at LoadModel time, and the typed ModelOptions fields keep their existing meaning with engine_args overriding them. Backend code: * backend/python/sglang/backend.py: add _apply_engine_args, import dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed -> sampling_seed (sglang 0.5.11 renamed the SamplingParams field). * backend/python/sglang/test.py + test.sh + Makefile: six unit tests exercising the helper directly (no engine load required). Build / CI / backend gallery (cuda13 + l4t13 paths are now first-class): * backend/python/sglang/install.sh: add --prerelease=allow because sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels; add --index-strategy=unsafe-best-match for cublas12 so the cu128 torch index wins over default-PyPI's cu130; new pyproject.toml-driven l4t13 install path so [tool.uv.sources] can pin torch/torchvision/ torchaudio/sglang to the jetson-ai-lab index without forcing every transitive PyPI dep through the L4T mirror's flaky proxy (mirrors the equivalent fix in backend/python/vllm/install.sh). * backend/python/sglang/pyproject.toml (new): L4T project spec with explicit-source jetson-ai-lab index. Replaces requirements-l4t13.txt for the l4t13 BUILD_PROFILE; other profiles still go through the requirements-*.txt pipeline via libbackend.sh's installRequirements. * backend/python/sglang/requirements-l4t13.txt: removed; superseded by pyproject.toml. * backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13 (new files) and cu128 torch index for cublas12 (default PyPI now ships cu130 torch wheels by default and breaks cu12 hosts). * backend/index.yaml: add cuda13-sglang and cuda13-sglang-development capability mappings + image entries pointing at quay.io/.../-gpu-nvidia-cuda-13-sglang. * .github/workflows/backend.yml: new cublas13 sglang matrix entry, mirroring vllm's cuda13 build. Model gallery + docs: * gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml. * gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands. * gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads + online fp8 weight quantization, verified end-to-end on a 16 GB RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the MTP draft worker's vocab embedding is loaded unquantised and OOMs the static reservation at sglang's 0.85 default. * gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp, gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang). * docs/content/features/text-generation.md: new SGLang section with setup, engine_args reference, MTP demos, version requirements. * .agents/sglang-backend.md (new): agent one-pager covering the flat ServerArgs structure, the typed-vs-engine_args precedence, the speculative-decoding cheatsheet, and the mem_fraction_static gotcha documented above. * AGENTS.md: index entry for the new agent doc. Known limitation: the two Gemma 4 MTP gallery entries ship a recipe that doesn't yet run on stock libraries. The drafter checkpoints (google/gemma-4-{E2B,E4B}-it-assistant) declare model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which neither transformers (<=5.6.0, including the SGLang cookbook's pinned commit 91b1ab1f... and main HEAD) nor sglang's own model registry (<=0.5.11) registers as of 2026-05-06. They will start working when HF or sglang upstream registers the architecture -- no LocalAI changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work today on this build (verified on RTX 5070 Ti, 16 GB). Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch] Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-16 12:38:01 -04:00 · 2026-05-07 16:27:29 +01:00
parent 048daa0cdc
commit c894d9c826
21 changed files with 732 additions and 21 deletions
--- a/.agents/sglang-backend.md
+++ b/.agents/sglang-backend.md
@@ -0,0 +1,62 @@
+# Working on the SGLang Backend
+
+The SGLang backend lives at `backend/python/sglang/backend.py` (async gRPC). It wraps SGLang's `Engine` (`sglang.srt.entrypoints.engine.Engine`) and translates LocalAI's gRPC `PredictOptions` into SGLang sampling params + outputs into `Reply.chat_deltas`. Structurally it mirrors `backend/python/vllm/backend.py` — keep them shaped the same so changes in one have an obvious analog in the other.
+
+## `engine_args` is the universal escape hatch
+
+A small fixed set of fields on `ModelOptions` is mapped to typed SGLang kwargs in `LoadModel` (model, quantization, load_format, gpu_memory_utilization → mem_fraction_static, trust_remote_code, enforce_eager → disable_cuda_graph, tensor_parallel_size → tp_size, max_model_len → context_length, dtype). **Everything else** flows through the `engine_args:` YAML map.
+
+Validation happens in `_apply_engine_args`. Keys are checked against `dataclasses.fields(ServerArgs)` (`sglang.srt.server_args.ServerArgs` is a flat `@dataclass` with ~380 fields). Unknown keys raise `ValueError` at LoadModel time with a `difflib.get_close_matches` suggestion — same shape as the vLLM backend.
+
+**Precedence:** typed `ModelOptions` fields populate `engine_kwargs` first, then `engine_args` overrides them. So a YAML that sets both `gpu_memory_utilization: 0.9` and `engine_args.mem_fraction_static: 0.5` ends up at `0.5`. Document this when answering "why didn't my YAML field stick?".
+
+**ServerArgs is flat.** Unlike vLLM, where speculative decoding is nested under `engine_args.speculative_config: {...}`, SGLang exposes flat top-level fields: `speculative_algorithm`, `speculative_draft_model_path`, `speculative_num_steps`, `speculative_eagle_topk`, `speculative_num_draft_tokens`, `speculative_dflash_block_size`, etc. There is no `speculative_config:` dict. Same goes for compilation, kv-transfer, attention — all flat.
+
+The canonical reference is `python/sglang/srt/server_args.py:ServerArgs` (line ~304). When SGLang adds new flags, no LocalAI code change is needed — they're automatically available via `engine_args:`. The validator picks them up because it introspects the live dataclass.
+
+## Speculative decoding cheatsheet
+
+`--speculative-algorithm` accepts `EAGLE`, `EAGLE3`, `NEXTN`, `STANDALONE`, `NGRAM`, `DFLASH`. `NEXTN` is silently rewritten to `EAGLE` in `ServerArgs.__post_init__` (`server_args.py:3286-3287`). MTP (Multi-Token Prediction) is the same EAGLE path with `num_steps=1, eagle_topk=1, num_draft_tokens=2` against a target whose architecture has multi-token heads (e.g. MiMo-7B-RL, DeepSeek-V3-MTP).
+
+| Algorithm | Drafter requirement | Gallery demo target | Gallery demo drafter |
+|-----------|--------------------|---------------------|----------------------|
+| `NEXTN` / `EAGLE` (MTP) | Assistant drafter or built-in heads | google/gemma-4-E2B-it, google/gemma-4-E4B-it | google/gemma-4-E2B-it-assistant, google/gemma-4-E4B-it-assistant |
+| `EAGLE3` | EAGLE3 draft head | (no gallery entry yet) | e.g. jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B |
+| `DFLASH` | Block-diffusion drafter | (no gallery entry yet) | e.g. z-lab/Qwen3-4B-DFlash-b16 |
+| `STANDALONE` | Smaller LLM as drafter | (no gallery entry yet) | any smaller chat-tuned LLM in the same family |
+| `NGRAM` | None — uses prefix history | (no gallery entry yet) | n/a |
+
+The Gemma 4 demos use `mem_fraction_static: 0.85` (cookbook default) and the cookbook's `num_steps=5, num_draft_tokens=6, eagle_topk=1` parameters. Other algorithms are reachable from any user YAML via `engine_args:` but don't have shipped demos yet — that's a deliberate gallery scope choice, not a backend limitation.
+
+Gemma 4 support requires sglang built from a commit that includes [PR #21952](https://github.com/sgl-project/sglang/pull/21952). LocalAI's pinned release for cublas12 / cublas13 includes it. The `l4t13` (JetPack 7 / sbsa cu130) build floors at `sglang>=0.5.0` because the `pypi.jetson-ai-lab.io` mirror still ships only `0.5.1.post2` as of 2026-05-06 — Gemma 4 / MTP recipes are therefore not available on l4t13 until that mirror catches up. `backend.py` keeps backward compat with the 0.5.x → 0.5.11 `SamplingParams.seed` → `sampling_seed` rename via runtime detection.
+
+Compatibility caveats per the SGLang docs: DFLASH and NGRAM are incompatible with `enable_dp_attention`; DFLASH requires `pp_size == 1`; STANDALONE is incompatible with `enable_dp_attention`; NGRAM is CUDA-only and disables the overlap scheduler.
+
+### `mem_fraction_static` + quantization + MTP on consumer GPUs
+
+When combining online weight quantization (`engine_args.quantization: fp8` / `awq` / etc.) with built-in-head MTP (`speculative_algorithm: EAGLE`/`NEXTN`) on a tight VRAM budget, sglang's default `mem_fraction_static: 0.85` will OOM during draft-worker init. The reason: sglang quantizes the **target** model's transformer blocks but loads the **MTP draft worker's vocab embedding** at the source dtype (typically bf16). For a 7 B-class model with a 150k-token vocab × 4096 hidden, that's another ~1.2 GiB allocated *after* the static pool is reserved. At 0.85 fraction on a 16 GB card there's no room left.
+
+Workaround: drop `mem_fraction_static` to ~0.7 so the post-static heap can absorb the MTP embedding alloc + CUDA graph private pools. Verified end-to-end on MiMo-7B-RL + fp8 + MTP on a 16 GB RTX 5070 Ti (`gallery/sglang-mimo-7b-mtp.yaml`) at ~88 tok/s. Models with larger vocabs or more MTP layers (e.g. DeepSeek-V3-MTP) need an even smaller fraction.
+
+This isn't documented anywhere upstream as of 2026-05-06 — the SGLang Gemma 4 cookbook uses 0.85 because their MTP path doesn't go through `eagle_worker_v2.py` for an embedding-bearing draft module. Don't blanket-apply 0.7 across all sglang YAMLs; only when MTP-with-built-in-heads + quantization combine.
+
+## Tool-call and reasoning parsers stay on `Options[]`
+
+ServerArgs has `tool_call_parser` and `reasoning_parser` fields, and the backend does pass them through to `Engine` so SGLang's own HTTP/OAI surface keeps working. But for the **LocalAI** request path the backend constructs fresh per-request parser instances in `_make_parsers` (`backend.py:286`) because the parsers are stateful — the streaming and non-streaming paths each need their own.
+
+So the user-facing knob stays on `Options[]`:
+
+```yaml
+options:
+  - tool_parser:hermes
+  - reasoning_parser:deepseek_r1
+```
+
+Putting these in `engine_args:` will set them on `ServerArgs` but the LocalAI-level streaming `ChatDelta` will not pick them up. Don't recommend that path.
+
+## What's missing today (out of scope, but worth tracking)
+
+- `core/config/hooks_sglang.go` — there is no SGLang equivalent of `hooks_vllm.go`. The vLLM hook auto-selects parsers for known model families from `parser_defaults.json` and seeds production engine_args defaults. A symmetric hook for SGLang could reuse the same `parser_defaults.json` (the SGLang parser names are different but the family detection is shared) and seed defaults like `enable_metrics: true` or attention-backend choices.
+- `core/gallery/importers/sglang.go` — vLLM has an importer that resolves model architecture → parser defaults at gallery-import time. A matching importer for SGLang would let `local-ai install` populate sensible parsers automatically.
+
+These should be a follow-up PR, not a blocker for the engine_args feature.
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
@@ -959,6 +959,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.python"
            context: "./"
            ubuntu-version: '2404'
+          - build-type: 'cublas'
+            cuda-major-version: "13"
+            cuda-minor-version: "0"
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-nvidia-cuda-13-sglang'
+            runs-on: 'arc-runner-set'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "sglang"
+            dockerfile: "./backend/Dockerfile.python"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: 'cublas'
            cuda-major-version: "13"
            cuda-minor-version: "0"
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -24,6 +24,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
 | [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
 | [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
 | [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
+| [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
 | [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
 | [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
 | [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -287,6 +287,7 @@
    amd: "rocm-sglang"
    intel: "intel-sglang"
    nvidia-cuda-12: "cuda12-sglang"
+    nvidia-cuda-13: "cuda13-sglang"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sglang"
    cpu: "cpu-sglang"
 - &vllm-omni
@@ -1965,6 +1966,7 @@
    amd: "rocm-sglang-development"
    intel: "intel-sglang-development"
    nvidia-cuda-12: "cuda12-sglang-development"
+    nvidia-cuda-13: "cuda13-sglang-development"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sglang-development"
    cpu: "cpu-sglang-development"
 - !!merge <<: *sglang
@@ -1972,6 +1974,11 @@
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-sglang"
  mirrors:
    - localai/localai-backends:latest-gpu-nvidia-cuda-12-sglang
+- !!merge <<: *sglang
+  name: "cuda13-sglang"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-sglang"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-13-sglang
 - !!merge <<: *sglang
  name: "cuda13-nvidia-l4t-arm64-sglang"
  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-sglang"
@@ -1997,6 +2004,11 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sglang"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-12-sglang
+- !!merge <<: *sglang
+  name: "cuda13-sglang-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-sglang"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-13-sglang
 - !!merge <<: *sglang
  name: "cuda13-nvidia-l4t-arm64-sglang-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-sglang"
--- a/backend/python/sglang/Makefile
+++ b/backend/python/sglang/Makefile
@@ -8,6 +8,12 @@ run: sglang
 	bash run.sh
 	@echo "sglang run."

+.PHONY: test
+test: sglang
+	@echo "Testing sglang..."
+	bash test.sh
+	@echo "sglang tested."
+
 .PHONY: protogen-clean
 protogen-clean:
 	$(RM) backend_pb2_grpc.py backend_pb2.py
--- a/backend/python/sglang/backend.py
+++ b/backend/python/sglang/backend.py
@@ -9,10 +9,18 @@ The streaming path applies sglang's per-request FunctionCallParser and
 ReasoningParser so tool_calls and reasoning_content are emitted
 incrementally inside ChatDelta, which is a capability sglang exposes
 natively and vLLM does not.
+
+Like the vLLM backend, this one accepts an arbitrary ``engine_args:``
+map in the model YAML; keys are validated against ``ServerArgs`` fields
+and forwarded to ``Engine(**kwargs)``. That covers speculative decoding
+(EAGLE/EAGLE3/DFLASH/NGRAM/STANDALONE plus MTP via NEXTN), attention
+backend selection, MoE knobs, hierarchical cache, and so on.
 """
 import asyncio
 from concurrent import futures
 import argparse
+import dataclasses
+import difflib
 import signal
 import sys
 import os
@@ -38,6 +46,7 @@ from grpc_auth import get_auth_interceptors
 # are wrapped in try/except so older / leaner installs that omit them
 # still load the backend for plain text generation.
 from sglang.srt.entrypoints.engine import Engine
+from sglang.srt.server_args import ServerArgs

 try:
    from sglang.srt.function_call.function_call_parser import FunctionCallParser
@@ -66,6 +75,19 @@ except Exception:
    HAS_TRANSFORMERS = False


+# sglang 0.5.11 renamed SamplingParams.seed -> sampling_seed (PR #21952).
+# Earlier 0.5.x releases (e.g. 0.5.1.post2 — the wheel still pinned by the
+# pypi.jetson-ai-lab.io sbsa/cu130 mirror used by the l4t13 build profile)
+# accept only `seed`. Detect the supported keyword once at import time so
+# both versions work without a hard pin floor.
+try:
+    import inspect as _inspect
+    from sglang.srt.sampling.sampling_params import SamplingParams as _SamplingParams
+    _SEED_KEY = "sampling_seed" if "sampling_seed" in _inspect.signature(_SamplingParams).parameters else "seed"
+except Exception:
+    _SEED_KEY = "sampling_seed"
+
+
 _ONE_DAY_IN_SECONDS = 60 * 60 * 24
 MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1'))

@@ -82,6 +104,37 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
            opts[key.strip()] = value.strip()
        return opts

+    def _apply_engine_args(self, engine_kwargs: dict, engine_args_json: str) -> dict:
+        """Merge user-supplied engine_args (JSON object) into the kwargs dict
+        that will be forwarded to ``sglang.Engine`` (which constructs a
+        ``ServerArgs`` from them).
+
+        Mirrors ``backend/python/vllm/backend.py::_apply_engine_args`` but
+        operates on the kwargs dict because sglang's ``Engine.__init__``
+        accepts ``**kwargs`` directly rather than a pre-built dataclass.
+        Validation happens against ``ServerArgs`` fields so a typo fails
+        early with a close-match suggestion instead of producing a confusing
+        ``TypeError`` deep inside engine startup.
+        """
+        if not engine_args_json:
+            return engine_kwargs
+        try:
+            extra = json.loads(engine_args_json)
+        except json.JSONDecodeError as e:
+            raise ValueError(f"engine_args is not valid JSON: {e}") from e
+        if not isinstance(extra, dict):
+            raise ValueError(
+                f"engine_args must be a JSON object, got {type(extra).__name__}"
+            )
+        valid = {f.name for f in dataclasses.fields(ServerArgs)}
+        for key in extra:
+            if key not in valid:
+                suggestion = difflib.get_close_matches(key, valid, n=1)
+                hint = f" did you mean {suggestion[0]!r}?" if suggestion else ""
+                raise ValueError(f"unknown engine_args key {key!r}.{hint}")
+        engine_kwargs.update(extra)
+        return engine_kwargs
+
    def _messages_to_dicts(self, messages) -> List[dict]:
        result: List[dict] = []
        for msg in messages:
@@ -137,6 +190,16 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
        if self.reasoning_parser_name:
            engine_kwargs["reasoning_parser"] = self.reasoning_parser_name

+        # engine_args from YAML overrides typed fields above so operators can
+        # tune anything ServerArgs exposes (speculative decoding, attention
+        # backend, MoE, hierarchical cache, …) without waiting on protobuf
+        # changes.
+        try:
+            engine_kwargs = self._apply_engine_args(engine_kwargs, request.EngineArgs)
+        except ValueError as err:
+            print(f"engine_args error: {err}", file=sys.stderr)
+            return backend_pb2.Result(success=False, message=str(err))
+
        try:
            self.llm = Engine(**engine_kwargs)
        except Exception as err:
@@ -221,7 +284,7 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
            "TopP": "top_p",
            "TopK": "top_k",
            "MinP": "min_p",
-            "Seed": "seed",
+            "Seed": _SEED_KEY,
            "StopPrompts": "stop",
            "StopTokenIds": "stop_token_ids",
            "IgnoreEOS": "ignore_eos",
--- a/backend/python/sglang/install.sh
+++ b/backend/python/sglang/install.sh
@@ -23,17 +23,32 @@ if [ "x${BUILD_PROFILE}" == "xcpu" ]; then
    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
 fi

+# cublas12 needs a cu128 torch index (see requirements-cublas12.txt) — without
+# unsafe-best-match uv falls through to default PyPI's cu130 torch wheel and
+# the resulting sgl-kernel can't load on our cu12 host libs.
+if [ "x${BUILD_PROFILE}" == "xcublas12" ]; then
+    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
+fi
+
+# sglang 0.5.11 (Gemma 4 support) declares flash-attn-4 as a hard dep, but
+# upstream only publishes pre-release wheels (4.0.0b*). uv rejects
+# pre-releases by default — opt in for sglang specifically. Drop this once
+# flash-attn-4 4.0 stable lands.
+EXTRA_PIP_INSTALL_FLAGS+=" --prerelease=allow"
+
 # JetPack 7 / L4T arm64 wheels are built for cp312 and shipped via
 # pypi.jetson-ai-lab.io. Bump the venv Python so the prebuilt sglang
-# wheel resolves cleanly. unsafe-best-match is required because the
-# jetson-ai-lab index lists transitive deps (e.g. decord) at older
-# versions only — without it uv refuses to fall through to PyPI for a
-# compatible wheel and resolution fails.
+# wheel resolves cleanly. The actual install on l4t13 goes through
+# pyproject.toml (see the elif branch below) so [tool.uv.sources] can
+# pin only torch/torchvision/torchaudio/sglang to the jetson-ai-lab
+# index — leaving PyPI as the path for transitive deps like
+# markdown-it-py / anthropic / propcache that the L4T mirror's proxy
+# 503s on. No --index-strategy flag here: the explicit index keeps the
+# scoping clean.
 if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
    PYTHON_VERSION="3.12"
    PYTHON_PATCH="12"
    PY_STANDALONE_TAG="20251120"
-    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
 fi

 # sglang's CPU path has no prebuilt wheel on PyPI — upstream publishes
@@ -95,6 +110,27 @@ if [ "x${BUILD_TYPE}" == "x" ] || [ "x${FROM_SOURCE:-}" == "xtrue" ]; then
        fi
        uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} .
    popd
+# L4T arm64 (JetPack 7): drive the install through pyproject.toml so that
+# [tool.uv.sources] can pin torch/torchvision/torchaudio/sglang to the
+# jetson-ai-lab index, while everything else (transitive deps and
+# PyPI-resolvable packages like transformers / accelerate) comes from
+# PyPI. Bypasses installRequirements because uv pip install -r
+# requirements.txt does not honor sources — see
+# backend/python/sglang/pyproject.toml for the rationale. Mirrors the
+# equivalent path in backend/python/vllm/install.sh.
+elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
+    ensureVenv
+    if [ "x${PORTABLE_PYTHON}" == "xtrue" ]; then
+        export C_INCLUDE_PATH="${C_INCLUDE_PATH:-}:$(_portable_dir)/include/python${PYTHON_VERSION}"
+    fi
+    pushd "${backend_dir}"
+        # Build deps first (matches installRequirements' requirements-install.txt
+        # pass — sglang/sgl-kernel sdists need packaging/setuptools-scm in the
+        # venv before they can build under --no-build-isolation).
+        uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} -r requirements-install.txt
+        uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --requirement pyproject.toml
+    popd
+    runProtogen
 else
    installRequirements
 fi
--- a/backend/python/sglang/pyproject.toml
+++ b/backend/python/sglang/pyproject.toml
@@ -0,0 +1,68 @@
+# L4T arm64 (JetPack 7 / sbsa cu130) install spec for the sglang backend.
+#
+# Why this file exists, and why only the l4t13 BUILD_PROFILE consumes it:
+#
+# pypi.jetson-ai-lab.io hosts the L4T-specific torch / sglang / sgl-kernel
+# wheels we need on aarch64 + cuda13, but it ALSO transparently proxies the
+# rest of PyPI through `/+f/<sha>/<filename>` URLs that 503 frequently.
+# With `--extra-index-url` + `--index-strategy=unsafe-best-match` (the
+# historical fix in install.sh) uv would pick those proxy URLs for ordinary
+# PyPI packages — markdown-it-py, anthropic, propcache, etc. — and trip on
+# the 503s. See e.g. CI run 25439791228 (markdown-it-py-4.0.0).
+#
+# `explicit = true` on the index makes uv consult the L4T mirror ONLY for
+# packages mapped under [tool.uv.sources]. Everything else goes to PyPI.
+# This breaks the historical 503 path without losing access to the L4T
+# wheels we actually need from there. Mirrors the equivalent fix already
+# in backend/python/vllm/pyproject.toml.
+#
+# `uv pip install -r requirements.txt` does NOT honor [tool.uv.sources]
+# (sources are project-mode only, not pip-compat mode), so install.sh's
+# l4t13 branch invokes `uv pip install --requirement pyproject.toml`
+# directly. Other BUILD_PROFILEs continue to use the requirements-*.txt
+# pipeline through libbackend.sh's installRequirements and never read
+# this file.
+[project]
+name = "localai-sglang-l4t13"
+version = "0.0.0"
+requires-python = ">=3.12,<3.13"
+dependencies = [
+    # Mirror of requirements.txt — kept in sync manually for now since the
+    # l4t13 path bypasses installRequirements (see install.sh).
+    "grpcio==1.80.0",
+    "protobuf",
+    "certifi",
+    "setuptools",
+    "pillow",
+    # L4T-specific accelerator stack (sourced from jetson-ai-lab below).
+    "torch",
+    "torchvision",
+    "torchaudio",
+    # sglang on jetson — the [all] extra is deliberately omitted because it
+    # pulls outlines/decord, and decord has no aarch64 cp312 wheel anywhere
+    # (PyPI nor the jetson-ai-lab index ships only legacy cp35-cp37). With
+    # [all] uv backtracks through versions trying to satisfy decord and
+    # lands on sglang==0.1.16. The 0.5.0 floor matches the only major
+    # series the jetson-ai-lab sbsa/cu130 mirror currently publishes
+    # (sglang==0.5.1.post2 as of 2026-05-06). Bumping to >=0.5.11 here
+    # would make the build unsatisfiable until the mirror catches up.
+    # Gemma 4 / MTP recipes are therefore not supported on l4t13 — those
+    # features land on cublas12/cublas13 hosts that pull the newer wheel
+    # from PyPI. backend.py keeps backward compat with the 0.5.x SamplingParams
+    # field rename via runtime detection.
+    "sglang>=0.5.0",
+    # PyPI-resolvable packages that complete the runtime.
+    "accelerate",
+    "transformers",
+]
+
+[[tool.uv.index]]
+name = "jetson-ai-lab"
+url = "https://pypi.jetson-ai-lab.io/sbsa/cu130"
+explicit = true
+
+[tool.uv.sources]
+torch = { index = "jetson-ai-lab" }
+torchvision = { index = "jetson-ai-lab" }
+torchaudio = { index = "jetson-ai-lab" }
+sglang = { index = "jetson-ai-lab" }
--- a/backend/python/sglang/requirements-cublas12-after.txt
+++ b/backend/python/sglang/requirements-cublas12-after.txt
@@ -1,3 +1,4 @@
 # Bump this pin deliberately — sglang releases weekly and API surfaces
 # (FunctionCallParser, ReasoningParser) move between releases.
-sglang[all]>=0.4.0
+# 0.5.11 is the floor for Gemma 4 support (PR sgl-project/sglang#21952).
+sglang[all]>=0.5.11
--- a/backend/python/sglang/requirements-cublas12.txt
+++ b/backend/python/sglang/requirements-cublas12.txt
@@ -1,5 +1,12 @@
+# sglang 0.5.11 hard-pins torch==2.9.1. PyPI's default torch 2.9.1 wheel is
+# now the cu130 build, which drags in cu130-flavoured sgl-kernel/sglang-kernel
+# binaries that need libnvrtc.so.13 — incompatible with our cu12 host libs.
+# Pin the cu128 PyTorch index so uv pulls cu12-flavoured torch (and the
+# matching sgl-kernel cu12 wheels). install.sh adds --index-strategy=unsafe-best-match
+# for cublas12 so uv consults this index alongside PyPI.
+--extra-index-url https://download.pytorch.org/whl/cu128
 accelerate
-torch==2.7.1
+torch==2.9.1
 torchvision
-torchaudio==2.7.1
+torchaudio
 transformers
--- a/backend/python/sglang/requirements-cublas13-after.txt
+++ b/backend/python/sglang/requirements-cublas13-after.txt
@@ -0,0 +1,4 @@
+# Bump this pin deliberately — sglang releases weekly and API surfaces
+# (FunctionCallParser, ReasoningParser) move between releases.
+# 0.5.11 is the floor for Gemma 4 support (PR sgl-project/sglang#21952).
+sglang[all]>=0.5.11
--- a/backend/python/sglang/requirements-cublas13.txt
+++ b/backend/python/sglang/requirements-cublas13.txt
@@ -0,0 +1,6 @@
+--extra-index-url https://download.pytorch.org/whl/cu130
+accelerate
+torch
+torchvision
+torchaudio
+transformers
--- a/backend/python/sglang/requirements-l4t13.txt
+++ b/backend/python/sglang/requirements-l4t13.txt
@@ -1,12 +0,0 @@
--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
-accelerate
-torch
-torchvision
-torchaudio
-transformers
-# Drop the [all] extra: it pulls outlines/decord, and decord has no
-# aarch64 cp312 wheel anywhere (PyPI nor the jetson-ai-lab index ships
-# only legacy cp35-cp37). With [all] uv backtracks through versions
-# trying to satisfy decord and lands on sglang==0.1.16. Floor at 0.5.0
-# so uv can't silently downgrade if a future resolution misfires.
-sglang>=0.5.0
--- a/backend/python/sglang/test.py
+++ b/backend/python/sglang/test.py
@@ -0,0 +1,101 @@
+"""Unit tests for the sglang backend.
+
+Helper-level tests run without launching the gRPC server or loading model
+weights — they only exercise the pure-Python helpers on
+``BackendServicer``. They do still require ``sglang`` to be importable
+because ``_apply_engine_args`` validates keys against
+``ServerArgs``'s dataclass fields.
+"""
+import unittest
+
+
+class TestSglangHelpers(unittest.TestCase):
+    """Tests for the pure helpers on BackendServicer (no gRPC, no engine)."""
+
+    def _servicer(self):
+        import sys
+        import os
+        sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+        from backend import BackendServicer  # noqa: E402
+        return BackendServicer()
+
+    def test_parse_options(self):
+        servicer = self._servicer()
+        opts = servicer._parse_options([
+            "tool_parser:hermes",
+            "reasoning_parser:deepseek_r1",
+            "invalid_no_colon",
+            "key_with_colons:a:b:c",
+        ])
+        self.assertEqual(opts["tool_parser"], "hermes")
+        self.assertEqual(opts["reasoning_parser"], "deepseek_r1")
+        self.assertEqual(opts["key_with_colons"], "a:b:c")
+        self.assertNotIn("invalid_no_colon", opts)
+
+    def test_apply_engine_args_known_keys(self):
+        """User-supplied JSON merges into the kwargs dict; pre-set typed
+        fields stay put when not overridden."""
+        import json as _json
+        servicer = self._servicer()
+        base = {
+            "model_path": "facebook/opt-125m",
+            "mem_fraction_static": 0.7,
+        }
+        extras = _json.dumps({
+            "trust_remote_code": True,
+            "speculative_algorithm": "EAGLE",
+            "speculative_num_steps": 1,
+        })
+        out = servicer._apply_engine_args(base, extras)
+        self.assertIs(out, base)  # in-place merge — same dict back
+        self.assertTrue(out["trust_remote_code"])
+        self.assertEqual(out["speculative_algorithm"], "EAGLE")
+        self.assertEqual(out["speculative_num_steps"], 1)
+        self.assertEqual(out["model_path"], "facebook/opt-125m")
+        self.assertEqual(out["mem_fraction_static"], 0.7)
+
+    def test_apply_engine_args_engine_args_overrides_typed_fields(self):
+        """engine_args wins over previously-set typed kwargs (vLLM precedence)."""
+        import json as _json
+        servicer = self._servicer()
+        base = {"model_path": "facebook/opt-125m", "mem_fraction_static": 0.7}
+        out = servicer._apply_engine_args(
+            base, _json.dumps({"mem_fraction_static": 0.5}),
+        )
+        self.assertEqual(out["mem_fraction_static"], 0.5)
+
+    def test_apply_engine_args_unknown_key_raises(self):
+        """Typo'd key raises ValueError with a close-match suggestion."""
+        import json as _json
+        servicer = self._servicer()
+        base = {"model_path": "facebook/opt-125m"}
+        with self.assertRaises(ValueError) as ctx:
+            servicer._apply_engine_args(
+                base, _json.dumps({"trust_remotecode": True}),
+            )
+        msg = str(ctx.exception)
+        self.assertIn("trust_remotecode", msg)
+        self.assertIn("trust_remote_code", msg)
+
+    def test_apply_engine_args_empty_passthrough(self):
+        """Empty / None engine_args returns the kwargs dict untouched."""
+        servicer = self._servicer()
+        base = {"model_path": "facebook/opt-125m"}
+        self.assertIs(servicer._apply_engine_args(base, ""), base)
+        self.assertIs(servicer._apply_engine_args(base, None), base)
+
+    def test_apply_engine_args_invalid_json_raises(self):
+        servicer = self._servicer()
+        with self.assertRaises(ValueError) as ctx:
+            servicer._apply_engine_args({}, "not-json")
+        self.assertIn("not valid JSON", str(ctx.exception))
+
+    def test_apply_engine_args_non_object_raises(self):
+        servicer = self._servicer()
+        with self.assertRaises(ValueError) as ctx:
+            servicer._apply_engine_args({}, "[1,2,3]")
+        self.assertIn("must be a JSON object", str(ctx.exception))
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/backend/python/sglang/test.sh
+++ b/backend/python/sglang/test.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+set -e
+
+backend_dir=$(dirname $0)
+
+if [ -d $backend_dir/common ]; then
+    source $backend_dir/common/libbackend.sh
+else
+    source $backend_dir/../common/libbackend.sh
+fi
+
+runUnittests
--- a/docs/content/features/text-generation.md
+++ b/docs/content/features/text-generation.md
@@ -721,6 +721,121 @@ GPU nodes. See [vLLM Multi-Node (Data-Parallel)]({{% relref
 "features/distributed-mode#vllm-multi-node-data-parallel" %}})
 for the head/follower configuration and a worked Kimi-K2.6 example.

+### SGLang
+
+[SGLang](https://github.com/sgl-project/sglang) is a fast serving
+framework for LLMs and VLMs with a focus on prefix caching, speculative
+decoding, and multi-modal generation. LocalAI ships a gRPC backend that
+wraps SGLang's async `Engine`, including its native function-call and
+reasoning parsers.
+
+#### Setup
+
+```yaml
+name: sglang
+backend: sglang
+parameters:
+  model: "Qwen/Qwen3-4B"
+template:
+  use_tokenizer_template: true
+```
+
+The backend will pull the model from HuggingFace on first load.
+
+#### Passing arbitrary SGLang options with `engine_args`
+
+The same `engine_args:` map that the vLLM backend accepts is also
+honoured by the SGLang backend. Keys are validated against
+[`ServerArgs`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py)
+— SGLang's central configuration dataclass — and forwarded verbatim to
+`Engine(**kwargs)`. Unknown keys fail at load time with the closest
+valid name as a hint. Unlike vLLM, `ServerArgs` is flat: speculative
+decoding fields are top-level (`speculative_algorithm`,
+`speculative_draft_model_path`, etc.) rather than nested under a
+`speculative_config:` dict.
+
+The typed YAML fields shared with vLLM are mapped to their SGLang
+equivalents (`gpu_memory_utilization` → `mem_fraction_static`,
+`enforce_eager` → `disable_cuda_graph`, `tensor_parallel_size` →
+`tp_size`, `max_model_len` → `context_length`). Anything else,
+including all speculative-decoding flags, goes under `engine_args:`.
+
+##### Speculative decoding: Gemma 4 with Multi-Token Prediction
+
+Google publishes paired "assistant" drafters for every Gemma 4 size.
+The drafters use Multi-Token Prediction (MTP) to propose several
+candidate tokens per target step, which SGLang then verifies in
+parallel. Flags below are transcribed verbatim from the
+[SGLang Gemma 4 cookbook](https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#speculative-decoding-mtp-server-commands).
+
+For consumer GPUs in the 16–24 GB range, use **E4B** (8 B total /
+4 B effective parameters):
+
+```yaml
+name: gemma-4-e4b-mtp
+backend: sglang
+parameters:
+  model: google/gemma-4-E4B-it
+context_size: 4096
+template:
+  use_tokenizer_template: true
+options:
+  - tool_parser:gemma4
+  - reasoning_parser:gemma4
+engine_args:
+  mem_fraction_static: 0.85
+  speculative_algorithm: NEXTN
+  speculative_draft_model_path: google/gemma-4-E4B-it-assistant
+  speculative_num_steps: 5
+  speculative_num_draft_tokens: 6
+  speculative_eagle_topk: 1
+```
+
+For smaller cards (8–12 GB), drop to **E2B** (5 B total / 2 B effective)
+by swapping the model paths to `google/gemma-4-E2B-it` and
+`google/gemma-4-E2B-it-assistant`; the rest of the flags stay the same.
+
+`NEXTN` is normalised to `EAGLE` inside `ServerArgs.__post_init__`, so
+either value works — the cookbook uses `NEXTN`. `mem_fraction_static`
+is the share of GPU memory SGLang reserves for the model + KV pool;
+0.85 is the cookbook's default and adapts to whatever single GPU the
+backend is running on.
+
+The 31 B dense and 26 B-A4B MoE Gemma 4 variants exist in the same
+cookbook but require `--tp-size 2`, so they're not in the gallery as
+single-GPU recipes.
+
+> **SGLang version requirement.** Gemma 4 support landed in SGLang via
+> [PR #21952](https://github.com/sgl-project/sglang/pull/21952). The
+> LocalAI sglang backend pins a release that includes it; if you've
+> overridden the pin to an older version, this recipe will fail with a
+> "model architecture not recognised" error at load time.
+
+##### Other speculative algorithms
+
+`speculative_algorithm:` also accepts `EAGLE`/`EAGLE3` (paired with an
+EAGLE-style draft head), `DFLASH` (block-diffusion drafters from
+[z-lab](https://huggingface.co/z-lab) for the Qwen3 family), `STANDALONE`
+(a smaller draft LLM verifying a larger target), and `NGRAM` (no draft
+model — pure prefix-history speculation). See SGLang's
+[speculative-decoding docs](https://docs.sglang.io/advanced_features/speculative_decoding.html)
+for the full algorithm matrix.
+
+#### Tool calling and reasoning parsers
+
+SGLang's native parsers stream `tool_calls` and `reasoning_content`
+inside `ChatDelta` — the LocalAI Python backend wires them up
+per-request rather than via `engine_args:`. Pick a parser by name:
+
+```yaml
+options:
+  - tool_parser:hermes
+  - reasoning_parser:deepseek_r1
+```
+
+The full list of registered parsers lives in `sglang.srt.function_call`
+and `sglang.srt.parser.reasoning_parser`.
+
 ### Transformers

 [Transformers](https://huggingface.co/docs/transformers/index) is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX.
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -24108,6 +24108,73 @@
  overrides:
    parameters:
      model: NousResearch/Hermes-3-Llama-3.1-405B
+- &gemma4-sglang-mtp
+  name: "gemma-4-e2b-it:sglang-mtp"
+  url: "github:mudler/LocalAI/gallery/sglang-gemma-4-e2b-mtp.yaml@master"
+  icon: https://ai.google.dev/static/gemma/images/gemma3.png
+  license: gemma
+  urls:
+    - https://huggingface.co/google/gemma-4-E2B-it
+    - https://huggingface.co/google/gemma-4-E2B-it-assistant
+    - https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4
+  description: |
+    Google Gemma 4 E2B-IT served by SGLang with Multi-Token Prediction
+    (MTP) speculative decoding. The companion drafter
+    google/gemma-4-E2B-it-assistant lets the target accept several
+    tokens per step. Flags are a 1:1 transcription of the SGLang
+    cookbook's MTP command (NEXTN algorithm, num_steps=5,
+    num_draft_tokens=6, eagle_topk=1, mem_fraction_static=0.85). The
+    E2B variant has 5B total / 2B effective parameters and targets the
+    smaller end of consumer GPUs.
+  tags:
+    - llm
+    - sglang
+    - gpu
+    - speculative-decoding
+    - mtp
+    - gemma
+    - gemma4
+    - gemma-4
+- !!merge <<: *gemma4-sglang-mtp
+  name: "gemma-4-e4b-it:sglang-mtp"
+  url: "github:mudler/LocalAI/gallery/sglang-gemma-4-e4b-mtp.yaml@master"
+  urls:
+    - https://huggingface.co/google/gemma-4-E4B-it
+    - https://huggingface.co/google/gemma-4-E4B-it-assistant
+    - https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4
+  description: |
+    Google Gemma 4 E4B-IT served by SGLang with Multi-Token Prediction
+    (MTP) speculative decoding. The companion drafter
+    google/gemma-4-E4B-it-assistant lets the target accept several
+    tokens per step. Flags are a 1:1 transcription of the SGLang
+    cookbook's MTP command (NEXTN algorithm, num_steps=5,
+    num_draft_tokens=6, eagle_topk=1, mem_fraction_static=0.85). The
+    E4B variant has 8B total / 4B effective parameters — the natural
+    pick for consumer GPUs in the 16–24 GB range.
+- name: "mimo-7b-mtp:sglang"
+  url: "github:mudler/LocalAI/gallery/sglang-mimo-7b-mtp.yaml@master"
+  icon: https://github.com/XiaomiMiMo/MiMo/raw/main/figures/Xiaomi_MiMo.png
+  license: mit
+  urls:
+    - https://huggingface.co/XiaomiMiMo/MiMo-7B-RL
+    - https://github.com/XiaomiMiMo/MiMo
+  description: |
+    Xiaomi MiMo-7B-RL served by SGLang with built-in Multi-Token
+    Prediction (MTP) heads (no separate drafter needed) plus online fp8
+    weight quantization to fit on a 16 GB consumer GPU. ~90% acceptance
+    per the model card. Verified end-to-end at ~88 tok/s on an RTX 5070
+    Ti (16 GB). Note: mem_fraction_static is dropped to 0.7 (vs sglang's
+    0.85 default) because the MTP draft worker's vocab embedding is
+    loaded unquantised (~1.2 GiB) and OOMs the static reservation
+    otherwise.
+  tags:
+    - llm
+    - sglang
+    - gpu
+    - speculative-decoding
+    - mtp
+    - reasoning
+    - fp8
 - name: codellama-7b
  url: github:mudler/LocalAI/gallery/codellama.yaml@master
  urls:
--- a/gallery/sglang-gemma-4-e2b-mtp.yaml
+++ b/gallery/sglang-gemma-4-e2b-mtp.yaml
@@ -0,0 +1,36 @@
+---
+name: "sglang-gemma-4-e2b-mtp"
+
+config_file: |
+    backend: sglang
+    parameters:
+      model: google/gemma-4-E2B-it
+      max_tokens: 4096
+    context_size: 4096
+    function:
+      disable_no_action: true
+      grammar:
+        disable: true
+        parallel_calls: true
+        expect_strings_after_json: true
+    template:
+      use_tokenizer_template: true
+    options:
+      - tool_parser:gemma4
+      - reasoning_parser:gemma4
+    # Gemma 4 E2B-it served by SGLang with Multi-Token Prediction (MTP).
+    # Flags transcribed verbatim from the SGLang cookbook:
+    # https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#speculative-decoding-mtp-server-commands
+    # NEXTN is normalised to EAGLE inside ServerArgs.__post_init__.
+    # mem_fraction_static=0.85 adapts to the available GPU; E2B is the
+    # smaller variant of the Gemma 4 lineup and the natural fit for
+    # consumer GPUs (notably 8–12 GB cards). Requires sglang built with
+    # PR #21952 (Gemma 4 model support); LocalAI's pinned release
+    # carries it.
+    engine_args:
+      mem_fraction_static: 0.85
+      speculative_algorithm: NEXTN
+      speculative_draft_model_path: google/gemma-4-E2B-it-assistant
+      speculative_num_steps: 5
+      speculative_num_draft_tokens: 6
+      speculative_eagle_topk: 1
--- a/gallery/sglang-gemma-4-e4b-mtp.yaml
+++ b/gallery/sglang-gemma-4-e4b-mtp.yaml
@@ -0,0 +1,36 @@
+---
+name: "sglang-gemma-4-e4b-mtp"
+
+config_file: |
+    backend: sglang
+    parameters:
+      model: google/gemma-4-E4B-it
+      max_tokens: 4096
+    context_size: 4096
+    function:
+      disable_no_action: true
+      grammar:
+        disable: true
+        parallel_calls: true
+        expect_strings_after_json: true
+    template:
+      use_tokenizer_template: true
+    options:
+      - tool_parser:gemma4
+      - reasoning_parser:gemma4
+    # Gemma 4 E4B-it served by SGLang with Multi-Token Prediction (MTP).
+    # Flags transcribed verbatim from the SGLang cookbook:
+    # https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#speculative-decoding-mtp-server-commands
+    # NEXTN is normalised to EAGLE inside ServerArgs.__post_init__.
+    # mem_fraction_static=0.85 adapts to the available GPU; E4B is the
+    # mid-size variant (8B total / 4B effective parameters) and targets
+    # consumer GPUs in the 16–24 GB range. Requires sglang built with
+    # PR #21952 (Gemma 4 model support); LocalAI's pinned release
+    # carries it.
+    engine_args:
+      mem_fraction_static: 0.85
+      speculative_algorithm: NEXTN
+      speculative_draft_model_path: google/gemma-4-E4B-it-assistant
+      speculative_num_steps: 5
+      speculative_num_draft_tokens: 6
+      speculative_eagle_topk: 1
--- a/gallery/sglang-mimo-7b-mtp.yaml
+++ b/gallery/sglang-mimo-7b-mtp.yaml
@@ -0,0 +1,34 @@
+---
+name: "sglang-mimo-7b-mtp"
+
+config_file: |
+    backend: sglang
+    parameters:
+      model: XiaomiMiMo/MiMo-7B-RL
+      max_tokens: 4096
+    context_size: 4096
+    trust_remote_code: true
+    function:
+      disable_no_action: true
+      grammar:
+        disable: true
+        parallel_calls: true
+        expect_strings_after_json: true
+    template:
+      use_tokenizer_template: true
+    # Xiaomi MiMo-7B-RL with built-in Multi-Token Prediction (MTP) heads
+    # served via SGLang's EAGLE-aliased speculative-decoding path. ~90%
+    # acceptance rate per the model card. Quantised to fp8 at load time
+    # so the 7 B target fits on a 16 GB consumer GPU; mem_fraction_static
+    # is reduced from sglang's 0.85 default because the MTP draft worker
+    # loads its vocab embedding unquantised (bf16, ~1.2 GiB for MiMo's
+    # 152k vocab × 4096 hidden) and OOMs at 0.85. Verified end-to-end on
+    # an RTX 5070 Ti (16 GB) at ~88 tok/s.
+    engine_args:
+      dtype: bfloat16
+      quantization: fp8
+      mem_fraction_static: 0.7
+      speculative_algorithm: EAGLE
+      speculative_num_steps: 1
+      speculative_eagle_topk: 1
+      speculative_num_draft_tokens: 2
--- a/gallery/sglang.yaml
+++ b/gallery/sglang.yaml
@@ -0,0 +1,43 @@
+---
+name: "sglang"
+
+config_file: |
+    backend: sglang
+    context_size: 8192
+    parameters:
+      max_tokens: 8192
+    function:
+      disable_no_action: true
+      grammar:
+        disable: true
+        parallel_calls: true
+        expect_strings_after_json: true
+    template:
+      use_tokenizer_template: true
+    # Uncomment to specify a quantization method (optional)
+    # quantization: "fp8"
+    # Uncomment to set dtype: "auto", "half", "float16", "bfloat16", "float", "float32"
+    # dtype: "bfloat16"
+    # Uncomment to limit static GPU memory (sglang's mem_fraction_static — analogous to vLLM gpu_memory_utilization)
+    # gpu_memory_utilization: 0.75
+    # Uncomment to trust remote code from huggingface
+    # trust_remote_code: true
+    # Uncomment to disable CUDA graph capture (sglang's disable_cuda_graph)
+    # enforce_eager: true
+    # Uncomment to specify the maximum length of a sequence (sglang's context_length)
+    # max_model_len: 32768
+    # Uncomment and specify the number of Tensor divisions
+    # tensor_parallel_size: 2
+    #
+    # Anything ServerArgs exposes (~380 fields including speculative
+    # decoding, attention backend, MoE/EP, hierarchical cache, …) can be
+    # passed verbatim under engine_args:. See
+    # https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py
+    # for the canonical list. Unknown keys fail at load time with a
+    # close-match suggestion.
+    # engine_args:
+    #   speculative_algorithm: EAGLE3
+    #   speculative_draft_model_path: lmsys/SGLang-EAGLE3-Llama-3.1-8B-Instruct-SpecForge
+    #   speculative_num_steps: 3
+    #   speculative_eagle_topk: 4
+    #   speculative_num_draft_tokens: 16