mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-16 12:38:01 -04:00
feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686)
Bring the sglang Python backend up to feature parity with vllm by adding
the same engine_args:-map plumbing the vLLM backend already has. Any
ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model
YAML, including the speculative-decoding flags needed for Multi-Token
Prediction. Validation matches the vllm backend's: keys are checked
against dataclasses.fields(ServerArgs), unknown keys raise ValueError
with a difflib close-match suggestion at LoadModel time, and the typed
ModelOptions fields keep their existing meaning with engine_args
overriding them.
Backend code:
* backend/python/sglang/backend.py: add _apply_engine_args, import
dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed ->
sampling_seed (sglang 0.5.11 renamed the SamplingParams field).
* backend/python/sglang/test.py + test.sh + Makefile: six unit tests
exercising the helper directly (no engine load required).
Build / CI / backend gallery (cuda13 + l4t13 paths are now first-class):
* backend/python/sglang/install.sh: add --prerelease=allow because
sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels;
add --index-strategy=unsafe-best-match for cublas12 so the cu128
torch index wins over default-PyPI's cu130; new pyproject.toml-driven
l4t13 install path so [tool.uv.sources] can pin torch/torchvision/
torchaudio/sglang to the jetson-ai-lab index without forcing every
transitive PyPI dep through the L4T mirror's flaky proxy (mirrors the
equivalent fix in backend/python/vllm/install.sh).
* backend/python/sglang/pyproject.toml (new): L4T project spec with
explicit-source jetson-ai-lab index. Replaces requirements-l4t13.txt
for the l4t13 BUILD_PROFILE; other profiles still go through the
requirements-*.txt pipeline via libbackend.sh's installRequirements.
* backend/python/sglang/requirements-l4t13.txt: removed; superseded
by pyproject.toml.
* backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin
sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13
(new files) and cu128 torch index for cublas12 (default PyPI now
ships cu130 torch wheels by default and breaks cu12 hosts).
* backend/index.yaml: add cuda13-sglang and cuda13-sglang-development
capability mappings + image entries pointing at
quay.io/.../-gpu-nvidia-cuda-13-sglang.
* .github/workflows/backend.yml: new cublas13 sglang matrix entry,
mirroring vllm's cuda13 build.
Model gallery + docs:
* gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml.
* gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos
transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands.
* gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads
+ online fp8 weight quantization, verified end-to-end on a 16 GB
RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the
MTP draft worker's vocab embedding is loaded unquantised and OOMs
the static reservation at sglang's 0.85 default.
* gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp,
gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang).
* docs/content/features/text-generation.md: new SGLang section with
setup, engine_args reference, MTP demos, version requirements.
* .agents/sglang-backend.md (new): agent one-pager covering the flat
ServerArgs structure, the typed-vs-engine_args precedence, the
speculative-decoding cheatsheet, and the mem_fraction_static gotcha
documented above.
* AGENTS.md: index entry for the new agent doc.
Known limitation: the two Gemma 4 MTP gallery entries ship a recipe
that doesn't yet run on stock libraries. The drafter checkpoints
(google/gemma-4-{E2B,E4B}-it-assistant) declare
model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which
neither transformers (<=5.6.0, including the SGLang cookbook's pinned
commit 91b1ab1f... and main HEAD) nor sglang's own model registry
(<=0.5.11) registers as of 2026-05-06. They will start working when
HF or sglang upstream registers the architecture -- no LocalAI
changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work
today on this build (verified on RTX 5070 Ti, 16 GB).
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
This commit is contained in:
committed by
GitHub
parent
048daa0cdc
commit
c894d9c826
62
.agents/sglang-backend.md
Normal file
62
.agents/sglang-backend.md
Normal file
@@ -0,0 +1,62 @@
|
||||
# Working on the SGLang Backend
|
||||
|
||||
The SGLang backend lives at `backend/python/sglang/backend.py` (async gRPC). It wraps SGLang's `Engine` (`sglang.srt.entrypoints.engine.Engine`) and translates LocalAI's gRPC `PredictOptions` into SGLang sampling params + outputs into `Reply.chat_deltas`. Structurally it mirrors `backend/python/vllm/backend.py` — keep them shaped the same so changes in one have an obvious analog in the other.
|
||||
|
||||
## `engine_args` is the universal escape hatch
|
||||
|
||||
A small fixed set of fields on `ModelOptions` is mapped to typed SGLang kwargs in `LoadModel` (model, quantization, load_format, gpu_memory_utilization → mem_fraction_static, trust_remote_code, enforce_eager → disable_cuda_graph, tensor_parallel_size → tp_size, max_model_len → context_length, dtype). **Everything else** flows through the `engine_args:` YAML map.
|
||||
|
||||
Validation happens in `_apply_engine_args`. Keys are checked against `dataclasses.fields(ServerArgs)` (`sglang.srt.server_args.ServerArgs` is a flat `@dataclass` with ~380 fields). Unknown keys raise `ValueError` at LoadModel time with a `difflib.get_close_matches` suggestion — same shape as the vLLM backend.
|
||||
|
||||
**Precedence:** typed `ModelOptions` fields populate `engine_kwargs` first, then `engine_args` overrides them. So a YAML that sets both `gpu_memory_utilization: 0.9` and `engine_args.mem_fraction_static: 0.5` ends up at `0.5`. Document this when answering "why didn't my YAML field stick?".
|
||||
|
||||
**ServerArgs is flat.** Unlike vLLM, where speculative decoding is nested under `engine_args.speculative_config: {...}`, SGLang exposes flat top-level fields: `speculative_algorithm`, `speculative_draft_model_path`, `speculative_num_steps`, `speculative_eagle_topk`, `speculative_num_draft_tokens`, `speculative_dflash_block_size`, etc. There is no `speculative_config:` dict. Same goes for compilation, kv-transfer, attention — all flat.
|
||||
|
||||
The canonical reference is `python/sglang/srt/server_args.py:ServerArgs` (line ~304). When SGLang adds new flags, no LocalAI code change is needed — they're automatically available via `engine_args:`. The validator picks them up because it introspects the live dataclass.
|
||||
|
||||
## Speculative decoding cheatsheet
|
||||
|
||||
`--speculative-algorithm` accepts `EAGLE`, `EAGLE3`, `NEXTN`, `STANDALONE`, `NGRAM`, `DFLASH`. `NEXTN` is silently rewritten to `EAGLE` in `ServerArgs.__post_init__` (`server_args.py:3286-3287`). MTP (Multi-Token Prediction) is the same EAGLE path with `num_steps=1, eagle_topk=1, num_draft_tokens=2` against a target whose architecture has multi-token heads (e.g. MiMo-7B-RL, DeepSeek-V3-MTP).
|
||||
|
||||
| Algorithm | Drafter requirement | Gallery demo target | Gallery demo drafter |
|
||||
|-----------|--------------------|---------------------|----------------------|
|
||||
| `NEXTN` / `EAGLE` (MTP) | Assistant drafter or built-in heads | google/gemma-4-E2B-it, google/gemma-4-E4B-it | google/gemma-4-E2B-it-assistant, google/gemma-4-E4B-it-assistant |
|
||||
| `EAGLE3` | EAGLE3 draft head | (no gallery entry yet) | e.g. jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B |
|
||||
| `DFLASH` | Block-diffusion drafter | (no gallery entry yet) | e.g. z-lab/Qwen3-4B-DFlash-b16 |
|
||||
| `STANDALONE` | Smaller LLM as drafter | (no gallery entry yet) | any smaller chat-tuned LLM in the same family |
|
||||
| `NGRAM` | None — uses prefix history | (no gallery entry yet) | n/a |
|
||||
|
||||
The Gemma 4 demos use `mem_fraction_static: 0.85` (cookbook default) and the cookbook's `num_steps=5, num_draft_tokens=6, eagle_topk=1` parameters. Other algorithms are reachable from any user YAML via `engine_args:` but don't have shipped demos yet — that's a deliberate gallery scope choice, not a backend limitation.
|
||||
|
||||
Gemma 4 support requires sglang built from a commit that includes [PR #21952](https://github.com/sgl-project/sglang/pull/21952). LocalAI's pinned release for cublas12 / cublas13 includes it. The `l4t13` (JetPack 7 / sbsa cu130) build floors at `sglang>=0.5.0` because the `pypi.jetson-ai-lab.io` mirror still ships only `0.5.1.post2` as of 2026-05-06 — Gemma 4 / MTP recipes are therefore not available on l4t13 until that mirror catches up. `backend.py` keeps backward compat with the 0.5.x → 0.5.11 `SamplingParams.seed` → `sampling_seed` rename via runtime detection.
|
||||
|
||||
Compatibility caveats per the SGLang docs: DFLASH and NGRAM are incompatible with `enable_dp_attention`; DFLASH requires `pp_size == 1`; STANDALONE is incompatible with `enable_dp_attention`; NGRAM is CUDA-only and disables the overlap scheduler.
|
||||
|
||||
### `mem_fraction_static` + quantization + MTP on consumer GPUs
|
||||
|
||||
When combining online weight quantization (`engine_args.quantization: fp8` / `awq` / etc.) with built-in-head MTP (`speculative_algorithm: EAGLE`/`NEXTN`) on a tight VRAM budget, sglang's default `mem_fraction_static: 0.85` will OOM during draft-worker init. The reason: sglang quantizes the **target** model's transformer blocks but loads the **MTP draft worker's vocab embedding** at the source dtype (typically bf16). For a 7 B-class model with a 150k-token vocab × 4096 hidden, that's another ~1.2 GiB allocated *after* the static pool is reserved. At 0.85 fraction on a 16 GB card there's no room left.
|
||||
|
||||
Workaround: drop `mem_fraction_static` to ~0.7 so the post-static heap can absorb the MTP embedding alloc + CUDA graph private pools. Verified end-to-end on MiMo-7B-RL + fp8 + MTP on a 16 GB RTX 5070 Ti (`gallery/sglang-mimo-7b-mtp.yaml`) at ~88 tok/s. Models with larger vocabs or more MTP layers (e.g. DeepSeek-V3-MTP) need an even smaller fraction.
|
||||
|
||||
This isn't documented anywhere upstream as of 2026-05-06 — the SGLang Gemma 4 cookbook uses 0.85 because their MTP path doesn't go through `eagle_worker_v2.py` for an embedding-bearing draft module. Don't blanket-apply 0.7 across all sglang YAMLs; only when MTP-with-built-in-heads + quantization combine.
|
||||
|
||||
## Tool-call and reasoning parsers stay on `Options[]`
|
||||
|
||||
ServerArgs has `tool_call_parser` and `reasoning_parser` fields, and the backend does pass them through to `Engine` so SGLang's own HTTP/OAI surface keeps working. But for the **LocalAI** request path the backend constructs fresh per-request parser instances in `_make_parsers` (`backend.py:286`) because the parsers are stateful — the streaming and non-streaming paths each need their own.
|
||||
|
||||
So the user-facing knob stays on `Options[]`:
|
||||
|
||||
```yaml
|
||||
options:
|
||||
- tool_parser:hermes
|
||||
- reasoning_parser:deepseek_r1
|
||||
```
|
||||
|
||||
Putting these in `engine_args:` will set them on `ServerArgs` but the LocalAI-level streaming `ChatDelta` will not pick them up. Don't recommend that path.
|
||||
|
||||
## What's missing today (out of scope, but worth tracking)
|
||||
|
||||
- `core/config/hooks_sglang.go` — there is no SGLang equivalent of `hooks_vllm.go`. The vLLM hook auto-selects parsers for known model families from `parser_defaults.json` and seeds production engine_args defaults. A symmetric hook for SGLang could reuse the same `parser_defaults.json` (the SGLang parser names are different but the family detection is shared) and seed defaults like `enable_metrics: true` or attention-backend choices.
|
||||
- `core/gallery/importers/sglang.go` — vLLM has an importer that resolves model architecture → parser defaults at gallery-import time. A matching importer for SGLang would let `local-ai install` populate sensible parsers automatically.
|
||||
|
||||
These should be a follow-up PR, not a blocker for the engine_args feature.
|
||||
13
.github/workflows/backend.yml
vendored
13
.github/workflows/backend.yml
vendored
@@ -959,6 +959,19 @@ jobs:
|
||||
dockerfile: "./backend/Dockerfile.python"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "13"
|
||||
cuda-minor-version: "0"
|
||||
platforms: 'linux/amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-nvidia-cuda-13-sglang'
|
||||
runs-on: 'arc-runner-set'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "sglang"
|
||||
dockerfile: "./backend/Dockerfile.python"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "13"
|
||||
cuda-minor-version: "0"
|
||||
|
||||
@@ -24,6 +24,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
|
||||
| [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
|
||||
| [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
|
||||
| [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
|
||||
| [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
|
||||
| [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
|
||||
| [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
|
||||
| [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
|
||||
|
||||
@@ -287,6 +287,7 @@
|
||||
amd: "rocm-sglang"
|
||||
intel: "intel-sglang"
|
||||
nvidia-cuda-12: "cuda12-sglang"
|
||||
nvidia-cuda-13: "cuda13-sglang"
|
||||
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sglang"
|
||||
cpu: "cpu-sglang"
|
||||
- &vllm-omni
|
||||
@@ -1965,6 +1966,7 @@
|
||||
amd: "rocm-sglang-development"
|
||||
intel: "intel-sglang-development"
|
||||
nvidia-cuda-12: "cuda12-sglang-development"
|
||||
nvidia-cuda-13: "cuda13-sglang-development"
|
||||
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sglang-development"
|
||||
cpu: "cpu-sglang-development"
|
||||
- !!merge <<: *sglang
|
||||
@@ -1972,6 +1974,11 @@
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-sglang"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-nvidia-cuda-12-sglang
|
||||
- !!merge <<: *sglang
|
||||
name: "cuda13-sglang"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-sglang"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-nvidia-cuda-13-sglang
|
||||
- !!merge <<: *sglang
|
||||
name: "cuda13-nvidia-l4t-arm64-sglang"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-sglang"
|
||||
@@ -1997,6 +2004,11 @@
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sglang"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-nvidia-cuda-12-sglang
|
||||
- !!merge <<: *sglang
|
||||
name: "cuda13-sglang-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-sglang"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-nvidia-cuda-13-sglang
|
||||
- !!merge <<: *sglang
|
||||
name: "cuda13-nvidia-l4t-arm64-sglang-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-sglang"
|
||||
|
||||
@@ -8,6 +8,12 @@ run: sglang
|
||||
bash run.sh
|
||||
@echo "sglang run."
|
||||
|
||||
.PHONY: test
|
||||
test: sglang
|
||||
@echo "Testing sglang..."
|
||||
bash test.sh
|
||||
@echo "sglang tested."
|
||||
|
||||
.PHONY: protogen-clean
|
||||
protogen-clean:
|
||||
$(RM) backend_pb2_grpc.py backend_pb2.py
|
||||
|
||||
@@ -9,10 +9,18 @@ The streaming path applies sglang's per-request FunctionCallParser and
|
||||
ReasoningParser so tool_calls and reasoning_content are emitted
|
||||
incrementally inside ChatDelta, which is a capability sglang exposes
|
||||
natively and vLLM does not.
|
||||
|
||||
Like the vLLM backend, this one accepts an arbitrary ``engine_args:``
|
||||
map in the model YAML; keys are validated against ``ServerArgs`` fields
|
||||
and forwarded to ``Engine(**kwargs)``. That covers speculative decoding
|
||||
(EAGLE/EAGLE3/DFLASH/NGRAM/STANDALONE plus MTP via NEXTN), attention
|
||||
backend selection, MoE knobs, hierarchical cache, and so on.
|
||||
"""
|
||||
import asyncio
|
||||
from concurrent import futures
|
||||
import argparse
|
||||
import dataclasses
|
||||
import difflib
|
||||
import signal
|
||||
import sys
|
||||
import os
|
||||
@@ -38,6 +46,7 @@ from grpc_auth import get_auth_interceptors
|
||||
# are wrapped in try/except so older / leaner installs that omit them
|
||||
# still load the backend for plain text generation.
|
||||
from sglang.srt.entrypoints.engine import Engine
|
||||
from sglang.srt.server_args import ServerArgs
|
||||
|
||||
try:
|
||||
from sglang.srt.function_call.function_call_parser import FunctionCallParser
|
||||
@@ -66,6 +75,19 @@ except Exception:
|
||||
HAS_TRANSFORMERS = False
|
||||
|
||||
|
||||
# sglang 0.5.11 renamed SamplingParams.seed -> sampling_seed (PR #21952).
|
||||
# Earlier 0.5.x releases (e.g. 0.5.1.post2 — the wheel still pinned by the
|
||||
# pypi.jetson-ai-lab.io sbsa/cu130 mirror used by the l4t13 build profile)
|
||||
# accept only `seed`. Detect the supported keyword once at import time so
|
||||
# both versions work without a hard pin floor.
|
||||
try:
|
||||
import inspect as _inspect
|
||||
from sglang.srt.sampling.sampling_params import SamplingParams as _SamplingParams
|
||||
_SEED_KEY = "sampling_seed" if "sampling_seed" in _inspect.signature(_SamplingParams).parameters else "seed"
|
||||
except Exception:
|
||||
_SEED_KEY = "sampling_seed"
|
||||
|
||||
|
||||
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
||||
MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1'))
|
||||
|
||||
@@ -82,6 +104,37 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
|
||||
opts[key.strip()] = value.strip()
|
||||
return opts
|
||||
|
||||
def _apply_engine_args(self, engine_kwargs: dict, engine_args_json: str) -> dict:
|
||||
"""Merge user-supplied engine_args (JSON object) into the kwargs dict
|
||||
that will be forwarded to ``sglang.Engine`` (which constructs a
|
||||
``ServerArgs`` from them).
|
||||
|
||||
Mirrors ``backend/python/vllm/backend.py::_apply_engine_args`` but
|
||||
operates on the kwargs dict because sglang's ``Engine.__init__``
|
||||
accepts ``**kwargs`` directly rather than a pre-built dataclass.
|
||||
Validation happens against ``ServerArgs`` fields so a typo fails
|
||||
early with a close-match suggestion instead of producing a confusing
|
||||
``TypeError`` deep inside engine startup.
|
||||
"""
|
||||
if not engine_args_json:
|
||||
return engine_kwargs
|
||||
try:
|
||||
extra = json.loads(engine_args_json)
|
||||
except json.JSONDecodeError as e:
|
||||
raise ValueError(f"engine_args is not valid JSON: {e}") from e
|
||||
if not isinstance(extra, dict):
|
||||
raise ValueError(
|
||||
f"engine_args must be a JSON object, got {type(extra).__name__}"
|
||||
)
|
||||
valid = {f.name for f in dataclasses.fields(ServerArgs)}
|
||||
for key in extra:
|
||||
if key not in valid:
|
||||
suggestion = difflib.get_close_matches(key, valid, n=1)
|
||||
hint = f" did you mean {suggestion[0]!r}?" if suggestion else ""
|
||||
raise ValueError(f"unknown engine_args key {key!r}.{hint}")
|
||||
engine_kwargs.update(extra)
|
||||
return engine_kwargs
|
||||
|
||||
def _messages_to_dicts(self, messages) -> List[dict]:
|
||||
result: List[dict] = []
|
||||
for msg in messages:
|
||||
@@ -137,6 +190,16 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
|
||||
if self.reasoning_parser_name:
|
||||
engine_kwargs["reasoning_parser"] = self.reasoning_parser_name
|
||||
|
||||
# engine_args from YAML overrides typed fields above so operators can
|
||||
# tune anything ServerArgs exposes (speculative decoding, attention
|
||||
# backend, MoE, hierarchical cache, …) without waiting on protobuf
|
||||
# changes.
|
||||
try:
|
||||
engine_kwargs = self._apply_engine_args(engine_kwargs, request.EngineArgs)
|
||||
except ValueError as err:
|
||||
print(f"engine_args error: {err}", file=sys.stderr)
|
||||
return backend_pb2.Result(success=False, message=str(err))
|
||||
|
||||
try:
|
||||
self.llm = Engine(**engine_kwargs)
|
||||
except Exception as err:
|
||||
@@ -221,7 +284,7 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
|
||||
"TopP": "top_p",
|
||||
"TopK": "top_k",
|
||||
"MinP": "min_p",
|
||||
"Seed": "seed",
|
||||
"Seed": _SEED_KEY,
|
||||
"StopPrompts": "stop",
|
||||
"StopTokenIds": "stop_token_ids",
|
||||
"IgnoreEOS": "ignore_eos",
|
||||
|
||||
@@ -23,17 +23,32 @@ if [ "x${BUILD_PROFILE}" == "xcpu" ]; then
|
||||
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
|
||||
fi
|
||||
|
||||
# cublas12 needs a cu128 torch index (see requirements-cublas12.txt) — without
|
||||
# unsafe-best-match uv falls through to default PyPI's cu130 torch wheel and
|
||||
# the resulting sgl-kernel can't load on our cu12 host libs.
|
||||
if [ "x${BUILD_PROFILE}" == "xcublas12" ]; then
|
||||
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
|
||||
fi
|
||||
|
||||
# sglang 0.5.11 (Gemma 4 support) declares flash-attn-4 as a hard dep, but
|
||||
# upstream only publishes pre-release wheels (4.0.0b*). uv rejects
|
||||
# pre-releases by default — opt in for sglang specifically. Drop this once
|
||||
# flash-attn-4 4.0 stable lands.
|
||||
EXTRA_PIP_INSTALL_FLAGS+=" --prerelease=allow"
|
||||
|
||||
# JetPack 7 / L4T arm64 wheels are built for cp312 and shipped via
|
||||
# pypi.jetson-ai-lab.io. Bump the venv Python so the prebuilt sglang
|
||||
# wheel resolves cleanly. unsafe-best-match is required because the
|
||||
# jetson-ai-lab index lists transitive deps (e.g. decord) at older
|
||||
# versions only — without it uv refuses to fall through to PyPI for a
|
||||
# compatible wheel and resolution fails.
|
||||
# wheel resolves cleanly. The actual install on l4t13 goes through
|
||||
# pyproject.toml (see the elif branch below) so [tool.uv.sources] can
|
||||
# pin only torch/torchvision/torchaudio/sglang to the jetson-ai-lab
|
||||
# index — leaving PyPI as the path for transitive deps like
|
||||
# markdown-it-py / anthropic / propcache that the L4T mirror's proxy
|
||||
# 503s on. No --index-strategy flag here: the explicit index keeps the
|
||||
# scoping clean.
|
||||
if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
|
||||
PYTHON_VERSION="3.12"
|
||||
PYTHON_PATCH="12"
|
||||
PY_STANDALONE_TAG="20251120"
|
||||
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
|
||||
fi
|
||||
|
||||
# sglang's CPU path has no prebuilt wheel on PyPI — upstream publishes
|
||||
@@ -95,6 +110,27 @@ if [ "x${BUILD_TYPE}" == "x" ] || [ "x${FROM_SOURCE:-}" == "xtrue" ]; then
|
||||
fi
|
||||
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} .
|
||||
popd
|
||||
# L4T arm64 (JetPack 7): drive the install through pyproject.toml so that
|
||||
# [tool.uv.sources] can pin torch/torchvision/torchaudio/sglang to the
|
||||
# jetson-ai-lab index, while everything else (transitive deps and
|
||||
# PyPI-resolvable packages like transformers / accelerate) comes from
|
||||
# PyPI. Bypasses installRequirements because uv pip install -r
|
||||
# requirements.txt does not honor sources — see
|
||||
# backend/python/sglang/pyproject.toml for the rationale. Mirrors the
|
||||
# equivalent path in backend/python/vllm/install.sh.
|
||||
elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
|
||||
ensureVenv
|
||||
if [ "x${PORTABLE_PYTHON}" == "xtrue" ]; then
|
||||
export C_INCLUDE_PATH="${C_INCLUDE_PATH:-}:$(_portable_dir)/include/python${PYTHON_VERSION}"
|
||||
fi
|
||||
pushd "${backend_dir}"
|
||||
# Build deps first (matches installRequirements' requirements-install.txt
|
||||
# pass — sglang/sgl-kernel sdists need packaging/setuptools-scm in the
|
||||
# venv before they can build under --no-build-isolation).
|
||||
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} -r requirements-install.txt
|
||||
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --requirement pyproject.toml
|
||||
popd
|
||||
runProtogen
|
||||
else
|
||||
installRequirements
|
||||
fi
|
||||
|
||||
68
backend/python/sglang/pyproject.toml
Normal file
68
backend/python/sglang/pyproject.toml
Normal file
@@ -0,0 +1,68 @@
|
||||
# L4T arm64 (JetPack 7 / sbsa cu130) install spec for the sglang backend.
|
||||
#
|
||||
# Why this file exists, and why only the l4t13 BUILD_PROFILE consumes it:
|
||||
#
|
||||
# pypi.jetson-ai-lab.io hosts the L4T-specific torch / sglang / sgl-kernel
|
||||
# wheels we need on aarch64 + cuda13, but it ALSO transparently proxies the
|
||||
# rest of PyPI through `/+f/<sha>/<filename>` URLs that 503 frequently.
|
||||
# With `--extra-index-url` + `--index-strategy=unsafe-best-match` (the
|
||||
# historical fix in install.sh) uv would pick those proxy URLs for ordinary
|
||||
# PyPI packages — markdown-it-py, anthropic, propcache, etc. — and trip on
|
||||
# the 503s. See e.g. CI run 25439791228 (markdown-it-py-4.0.0).
|
||||
#
|
||||
# `explicit = true` on the index makes uv consult the L4T mirror ONLY for
|
||||
# packages mapped under [tool.uv.sources]. Everything else goes to PyPI.
|
||||
# This breaks the historical 503 path without losing access to the L4T
|
||||
# wheels we actually need from there. Mirrors the equivalent fix already
|
||||
# in backend/python/vllm/pyproject.toml.
|
||||
#
|
||||
# `uv pip install -r requirements.txt` does NOT honor [tool.uv.sources]
|
||||
# (sources are project-mode only, not pip-compat mode), so install.sh's
|
||||
# l4t13 branch invokes `uv pip install --requirement pyproject.toml`
|
||||
# directly. Other BUILD_PROFILEs continue to use the requirements-*.txt
|
||||
# pipeline through libbackend.sh's installRequirements and never read
|
||||
# this file.
|
||||
[project]
|
||||
name = "localai-sglang-l4t13"
|
||||
version = "0.0.0"
|
||||
requires-python = ">=3.12,<3.13"
|
||||
dependencies = [
|
||||
# Mirror of requirements.txt — kept in sync manually for now since the
|
||||
# l4t13 path bypasses installRequirements (see install.sh).
|
||||
"grpcio==1.80.0",
|
||||
"protobuf",
|
||||
"certifi",
|
||||
"setuptools",
|
||||
"pillow",
|
||||
# L4T-specific accelerator stack (sourced from jetson-ai-lab below).
|
||||
"torch",
|
||||
"torchvision",
|
||||
"torchaudio",
|
||||
# sglang on jetson — the [all] extra is deliberately omitted because it
|
||||
# pulls outlines/decord, and decord has no aarch64 cp312 wheel anywhere
|
||||
# (PyPI nor the jetson-ai-lab index ships only legacy cp35-cp37). With
|
||||
# [all] uv backtracks through versions trying to satisfy decord and
|
||||
# lands on sglang==0.1.16. The 0.5.0 floor matches the only major
|
||||
# series the jetson-ai-lab sbsa/cu130 mirror currently publishes
|
||||
# (sglang==0.5.1.post2 as of 2026-05-06). Bumping to >=0.5.11 here
|
||||
# would make the build unsatisfiable until the mirror catches up.
|
||||
# Gemma 4 / MTP recipes are therefore not supported on l4t13 — those
|
||||
# features land on cublas12/cublas13 hosts that pull the newer wheel
|
||||
# from PyPI. backend.py keeps backward compat with the 0.5.x SamplingParams
|
||||
# field rename via runtime detection.
|
||||
"sglang>=0.5.0",
|
||||
# PyPI-resolvable packages that complete the runtime.
|
||||
"accelerate",
|
||||
"transformers",
|
||||
]
|
||||
|
||||
[[tool.uv.index]]
|
||||
name = "jetson-ai-lab"
|
||||
url = "https://pypi.jetson-ai-lab.io/sbsa/cu130"
|
||||
explicit = true
|
||||
|
||||
[tool.uv.sources]
|
||||
torch = { index = "jetson-ai-lab" }
|
||||
torchvision = { index = "jetson-ai-lab" }
|
||||
torchaudio = { index = "jetson-ai-lab" }
|
||||
sglang = { index = "jetson-ai-lab" }
|
||||
@@ -1,3 +1,4 @@
|
||||
# Bump this pin deliberately — sglang releases weekly and API surfaces
|
||||
# (FunctionCallParser, ReasoningParser) move between releases.
|
||||
sglang[all]>=0.4.0
|
||||
# 0.5.11 is the floor for Gemma 4 support (PR sgl-project/sglang#21952).
|
||||
sglang[all]>=0.5.11
|
||||
|
||||
@@ -1,5 +1,12 @@
|
||||
# sglang 0.5.11 hard-pins torch==2.9.1. PyPI's default torch 2.9.1 wheel is
|
||||
# now the cu130 build, which drags in cu130-flavoured sgl-kernel/sglang-kernel
|
||||
# binaries that need libnvrtc.so.13 — incompatible with our cu12 host libs.
|
||||
# Pin the cu128 PyTorch index so uv pulls cu12-flavoured torch (and the
|
||||
# matching sgl-kernel cu12 wheels). install.sh adds --index-strategy=unsafe-best-match
|
||||
# for cublas12 so uv consults this index alongside PyPI.
|
||||
--extra-index-url https://download.pytorch.org/whl/cu128
|
||||
accelerate
|
||||
torch==2.7.1
|
||||
torch==2.9.1
|
||||
torchvision
|
||||
torchaudio==2.7.1
|
||||
torchaudio
|
||||
transformers
|
||||
|
||||
4
backend/python/sglang/requirements-cublas13-after.txt
Normal file
4
backend/python/sglang/requirements-cublas13-after.txt
Normal file
@@ -0,0 +1,4 @@
|
||||
# Bump this pin deliberately — sglang releases weekly and API surfaces
|
||||
# (FunctionCallParser, ReasoningParser) move between releases.
|
||||
# 0.5.11 is the floor for Gemma 4 support (PR sgl-project/sglang#21952).
|
||||
sglang[all]>=0.5.11
|
||||
6
backend/python/sglang/requirements-cublas13.txt
Normal file
6
backend/python/sglang/requirements-cublas13.txt
Normal file
@@ -0,0 +1,6 @@
|
||||
--extra-index-url https://download.pytorch.org/whl/cu130
|
||||
accelerate
|
||||
torch
|
||||
torchvision
|
||||
torchaudio
|
||||
transformers
|
||||
@@ -1,12 +0,0 @@
|
||||
--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
|
||||
accelerate
|
||||
torch
|
||||
torchvision
|
||||
torchaudio
|
||||
transformers
|
||||
# Drop the [all] extra: it pulls outlines/decord, and decord has no
|
||||
# aarch64 cp312 wheel anywhere (PyPI nor the jetson-ai-lab index ships
|
||||
# only legacy cp35-cp37). With [all] uv backtracks through versions
|
||||
# trying to satisfy decord and lands on sglang==0.1.16. Floor at 0.5.0
|
||||
# so uv can't silently downgrade if a future resolution misfires.
|
||||
sglang>=0.5.0
|
||||
101
backend/python/sglang/test.py
Normal file
101
backend/python/sglang/test.py
Normal file
@@ -0,0 +1,101 @@
|
||||
"""Unit tests for the sglang backend.
|
||||
|
||||
Helper-level tests run without launching the gRPC server or loading model
|
||||
weights — they only exercise the pure-Python helpers on
|
||||
``BackendServicer``. They do still require ``sglang`` to be importable
|
||||
because ``_apply_engine_args`` validates keys against
|
||||
``ServerArgs``'s dataclass fields.
|
||||
"""
|
||||
import unittest
|
||||
|
||||
|
||||
class TestSglangHelpers(unittest.TestCase):
|
||||
"""Tests for the pure helpers on BackendServicer (no gRPC, no engine)."""
|
||||
|
||||
def _servicer(self):
|
||||
import sys
|
||||
import os
|
||||
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
||||
from backend import BackendServicer # noqa: E402
|
||||
return BackendServicer()
|
||||
|
||||
def test_parse_options(self):
|
||||
servicer = self._servicer()
|
||||
opts = servicer._parse_options([
|
||||
"tool_parser:hermes",
|
||||
"reasoning_parser:deepseek_r1",
|
||||
"invalid_no_colon",
|
||||
"key_with_colons:a:b:c",
|
||||
])
|
||||
self.assertEqual(opts["tool_parser"], "hermes")
|
||||
self.assertEqual(opts["reasoning_parser"], "deepseek_r1")
|
||||
self.assertEqual(opts["key_with_colons"], "a:b:c")
|
||||
self.assertNotIn("invalid_no_colon", opts)
|
||||
|
||||
def test_apply_engine_args_known_keys(self):
|
||||
"""User-supplied JSON merges into the kwargs dict; pre-set typed
|
||||
fields stay put when not overridden."""
|
||||
import json as _json
|
||||
servicer = self._servicer()
|
||||
base = {
|
||||
"model_path": "facebook/opt-125m",
|
||||
"mem_fraction_static": 0.7,
|
||||
}
|
||||
extras = _json.dumps({
|
||||
"trust_remote_code": True,
|
||||
"speculative_algorithm": "EAGLE",
|
||||
"speculative_num_steps": 1,
|
||||
})
|
||||
out = servicer._apply_engine_args(base, extras)
|
||||
self.assertIs(out, base) # in-place merge — same dict back
|
||||
self.assertTrue(out["trust_remote_code"])
|
||||
self.assertEqual(out["speculative_algorithm"], "EAGLE")
|
||||
self.assertEqual(out["speculative_num_steps"], 1)
|
||||
self.assertEqual(out["model_path"], "facebook/opt-125m")
|
||||
self.assertEqual(out["mem_fraction_static"], 0.7)
|
||||
|
||||
def test_apply_engine_args_engine_args_overrides_typed_fields(self):
|
||||
"""engine_args wins over previously-set typed kwargs (vLLM precedence)."""
|
||||
import json as _json
|
||||
servicer = self._servicer()
|
||||
base = {"model_path": "facebook/opt-125m", "mem_fraction_static": 0.7}
|
||||
out = servicer._apply_engine_args(
|
||||
base, _json.dumps({"mem_fraction_static": 0.5}),
|
||||
)
|
||||
self.assertEqual(out["mem_fraction_static"], 0.5)
|
||||
|
||||
def test_apply_engine_args_unknown_key_raises(self):
|
||||
"""Typo'd key raises ValueError with a close-match suggestion."""
|
||||
import json as _json
|
||||
servicer = self._servicer()
|
||||
base = {"model_path": "facebook/opt-125m"}
|
||||
with self.assertRaises(ValueError) as ctx:
|
||||
servicer._apply_engine_args(
|
||||
base, _json.dumps({"trust_remotecode": True}),
|
||||
)
|
||||
msg = str(ctx.exception)
|
||||
self.assertIn("trust_remotecode", msg)
|
||||
self.assertIn("trust_remote_code", msg)
|
||||
|
||||
def test_apply_engine_args_empty_passthrough(self):
|
||||
"""Empty / None engine_args returns the kwargs dict untouched."""
|
||||
servicer = self._servicer()
|
||||
base = {"model_path": "facebook/opt-125m"}
|
||||
self.assertIs(servicer._apply_engine_args(base, ""), base)
|
||||
self.assertIs(servicer._apply_engine_args(base, None), base)
|
||||
|
||||
def test_apply_engine_args_invalid_json_raises(self):
|
||||
servicer = self._servicer()
|
||||
with self.assertRaises(ValueError) as ctx:
|
||||
servicer._apply_engine_args({}, "not-json")
|
||||
self.assertIn("not valid JSON", str(ctx.exception))
|
||||
|
||||
def test_apply_engine_args_non_object_raises(self):
|
||||
servicer = self._servicer()
|
||||
with self.assertRaises(ValueError) as ctx:
|
||||
servicer._apply_engine_args({}, "[1,2,3]")
|
||||
self.assertIn("must be a JSON object", str(ctx.exception))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
12
backend/python/sglang/test.sh
Executable file
12
backend/python/sglang/test.sh
Executable file
@@ -0,0 +1,12 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
backend_dir=$(dirname $0)
|
||||
|
||||
if [ -d $backend_dir/common ]; then
|
||||
source $backend_dir/common/libbackend.sh
|
||||
else
|
||||
source $backend_dir/../common/libbackend.sh
|
||||
fi
|
||||
|
||||
runUnittests
|
||||
@@ -721,6 +721,121 @@ GPU nodes. See [vLLM Multi-Node (Data-Parallel)]({{% relref
|
||||
"features/distributed-mode#vllm-multi-node-data-parallel" %}})
|
||||
for the head/follower configuration and a worked Kimi-K2.6 example.
|
||||
|
||||
### SGLang
|
||||
|
||||
[SGLang](https://github.com/sgl-project/sglang) is a fast serving
|
||||
framework for LLMs and VLMs with a focus on prefix caching, speculative
|
||||
decoding, and multi-modal generation. LocalAI ships a gRPC backend that
|
||||
wraps SGLang's async `Engine`, including its native function-call and
|
||||
reasoning parsers.
|
||||
|
||||
#### Setup
|
||||
|
||||
```yaml
|
||||
name: sglang
|
||||
backend: sglang
|
||||
parameters:
|
||||
model: "Qwen/Qwen3-4B"
|
||||
template:
|
||||
use_tokenizer_template: true
|
||||
```
|
||||
|
||||
The backend will pull the model from HuggingFace on first load.
|
||||
|
||||
#### Passing arbitrary SGLang options with `engine_args`
|
||||
|
||||
The same `engine_args:` map that the vLLM backend accepts is also
|
||||
honoured by the SGLang backend. Keys are validated against
|
||||
[`ServerArgs`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py)
|
||||
— SGLang's central configuration dataclass — and forwarded verbatim to
|
||||
`Engine(**kwargs)`. Unknown keys fail at load time with the closest
|
||||
valid name as a hint. Unlike vLLM, `ServerArgs` is flat: speculative
|
||||
decoding fields are top-level (`speculative_algorithm`,
|
||||
`speculative_draft_model_path`, etc.) rather than nested under a
|
||||
`speculative_config:` dict.
|
||||
|
||||
The typed YAML fields shared with vLLM are mapped to their SGLang
|
||||
equivalents (`gpu_memory_utilization` → `mem_fraction_static`,
|
||||
`enforce_eager` → `disable_cuda_graph`, `tensor_parallel_size` →
|
||||
`tp_size`, `max_model_len` → `context_length`). Anything else,
|
||||
including all speculative-decoding flags, goes under `engine_args:`.
|
||||
|
||||
##### Speculative decoding: Gemma 4 with Multi-Token Prediction
|
||||
|
||||
Google publishes paired "assistant" drafters for every Gemma 4 size.
|
||||
The drafters use Multi-Token Prediction (MTP) to propose several
|
||||
candidate tokens per target step, which SGLang then verifies in
|
||||
parallel. Flags below are transcribed verbatim from the
|
||||
[SGLang Gemma 4 cookbook](https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#speculative-decoding-mtp-server-commands).
|
||||
|
||||
For consumer GPUs in the 16–24 GB range, use **E4B** (8 B total /
|
||||
4 B effective parameters):
|
||||
|
||||
```yaml
|
||||
name: gemma-4-e4b-mtp
|
||||
backend: sglang
|
||||
parameters:
|
||||
model: google/gemma-4-E4B-it
|
||||
context_size: 4096
|
||||
template:
|
||||
use_tokenizer_template: true
|
||||
options:
|
||||
- tool_parser:gemma4
|
||||
- reasoning_parser:gemma4
|
||||
engine_args:
|
||||
mem_fraction_static: 0.85
|
||||
speculative_algorithm: NEXTN
|
||||
speculative_draft_model_path: google/gemma-4-E4B-it-assistant
|
||||
speculative_num_steps: 5
|
||||
speculative_num_draft_tokens: 6
|
||||
speculative_eagle_topk: 1
|
||||
```
|
||||
|
||||
For smaller cards (8–12 GB), drop to **E2B** (5 B total / 2 B effective)
|
||||
by swapping the model paths to `google/gemma-4-E2B-it` and
|
||||
`google/gemma-4-E2B-it-assistant`; the rest of the flags stay the same.
|
||||
|
||||
`NEXTN` is normalised to `EAGLE` inside `ServerArgs.__post_init__`, so
|
||||
either value works — the cookbook uses `NEXTN`. `mem_fraction_static`
|
||||
is the share of GPU memory SGLang reserves for the model + KV pool;
|
||||
0.85 is the cookbook's default and adapts to whatever single GPU the
|
||||
backend is running on.
|
||||
|
||||
The 31 B dense and 26 B-A4B MoE Gemma 4 variants exist in the same
|
||||
cookbook but require `--tp-size 2`, so they're not in the gallery as
|
||||
single-GPU recipes.
|
||||
|
||||
> **SGLang version requirement.** Gemma 4 support landed in SGLang via
|
||||
> [PR #21952](https://github.com/sgl-project/sglang/pull/21952). The
|
||||
> LocalAI sglang backend pins a release that includes it; if you've
|
||||
> overridden the pin to an older version, this recipe will fail with a
|
||||
> "model architecture not recognised" error at load time.
|
||||
|
||||
##### Other speculative algorithms
|
||||
|
||||
`speculative_algorithm:` also accepts `EAGLE`/`EAGLE3` (paired with an
|
||||
EAGLE-style draft head), `DFLASH` (block-diffusion drafters from
|
||||
[z-lab](https://huggingface.co/z-lab) for the Qwen3 family), `STANDALONE`
|
||||
(a smaller draft LLM verifying a larger target), and `NGRAM` (no draft
|
||||
model — pure prefix-history speculation). See SGLang's
|
||||
[speculative-decoding docs](https://docs.sglang.io/advanced_features/speculative_decoding.html)
|
||||
for the full algorithm matrix.
|
||||
|
||||
#### Tool calling and reasoning parsers
|
||||
|
||||
SGLang's native parsers stream `tool_calls` and `reasoning_content`
|
||||
inside `ChatDelta` — the LocalAI Python backend wires them up
|
||||
per-request rather than via `engine_args:`. Pick a parser by name:
|
||||
|
||||
```yaml
|
||||
options:
|
||||
- tool_parser:hermes
|
||||
- reasoning_parser:deepseek_r1
|
||||
```
|
||||
|
||||
The full list of registered parsers lives in `sglang.srt.function_call`
|
||||
and `sglang.srt.parser.reasoning_parser`.
|
||||
|
||||
### Transformers
|
||||
|
||||
[Transformers](https://huggingface.co/docs/transformers/index) is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX.
|
||||
|
||||
@@ -24108,6 +24108,73 @@
|
||||
overrides:
|
||||
parameters:
|
||||
model: NousResearch/Hermes-3-Llama-3.1-405B
|
||||
- &gemma4-sglang-mtp
|
||||
name: "gemma-4-e2b-it:sglang-mtp"
|
||||
url: "github:mudler/LocalAI/gallery/sglang-gemma-4-e2b-mtp.yaml@master"
|
||||
icon: https://ai.google.dev/static/gemma/images/gemma3.png
|
||||
license: gemma
|
||||
urls:
|
||||
- https://huggingface.co/google/gemma-4-E2B-it
|
||||
- https://huggingface.co/google/gemma-4-E2B-it-assistant
|
||||
- https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4
|
||||
description: |
|
||||
Google Gemma 4 E2B-IT served by SGLang with Multi-Token Prediction
|
||||
(MTP) speculative decoding. The companion drafter
|
||||
google/gemma-4-E2B-it-assistant lets the target accept several
|
||||
tokens per step. Flags are a 1:1 transcription of the SGLang
|
||||
cookbook's MTP command (NEXTN algorithm, num_steps=5,
|
||||
num_draft_tokens=6, eagle_topk=1, mem_fraction_static=0.85). The
|
||||
E2B variant has 5B total / 2B effective parameters and targets the
|
||||
smaller end of consumer GPUs.
|
||||
tags:
|
||||
- llm
|
||||
- sglang
|
||||
- gpu
|
||||
- speculative-decoding
|
||||
- mtp
|
||||
- gemma
|
||||
- gemma4
|
||||
- gemma-4
|
||||
- !!merge <<: *gemma4-sglang-mtp
|
||||
name: "gemma-4-e4b-it:sglang-mtp"
|
||||
url: "github:mudler/LocalAI/gallery/sglang-gemma-4-e4b-mtp.yaml@master"
|
||||
urls:
|
||||
- https://huggingface.co/google/gemma-4-E4B-it
|
||||
- https://huggingface.co/google/gemma-4-E4B-it-assistant
|
||||
- https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4
|
||||
description: |
|
||||
Google Gemma 4 E4B-IT served by SGLang with Multi-Token Prediction
|
||||
(MTP) speculative decoding. The companion drafter
|
||||
google/gemma-4-E4B-it-assistant lets the target accept several
|
||||
tokens per step. Flags are a 1:1 transcription of the SGLang
|
||||
cookbook's MTP command (NEXTN algorithm, num_steps=5,
|
||||
num_draft_tokens=6, eagle_topk=1, mem_fraction_static=0.85). The
|
||||
E4B variant has 8B total / 4B effective parameters — the natural
|
||||
pick for consumer GPUs in the 16–24 GB range.
|
||||
- name: "mimo-7b-mtp:sglang"
|
||||
url: "github:mudler/LocalAI/gallery/sglang-mimo-7b-mtp.yaml@master"
|
||||
icon: https://github.com/XiaomiMiMo/MiMo/raw/main/figures/Xiaomi_MiMo.png
|
||||
license: mit
|
||||
urls:
|
||||
- https://huggingface.co/XiaomiMiMo/MiMo-7B-RL
|
||||
- https://github.com/XiaomiMiMo/MiMo
|
||||
description: |
|
||||
Xiaomi MiMo-7B-RL served by SGLang with built-in Multi-Token
|
||||
Prediction (MTP) heads (no separate drafter needed) plus online fp8
|
||||
weight quantization to fit on a 16 GB consumer GPU. ~90% acceptance
|
||||
per the model card. Verified end-to-end at ~88 tok/s on an RTX 5070
|
||||
Ti (16 GB). Note: mem_fraction_static is dropped to 0.7 (vs sglang's
|
||||
0.85 default) because the MTP draft worker's vocab embedding is
|
||||
loaded unquantised (~1.2 GiB) and OOMs the static reservation
|
||||
otherwise.
|
||||
tags:
|
||||
- llm
|
||||
- sglang
|
||||
- gpu
|
||||
- speculative-decoding
|
||||
- mtp
|
||||
- reasoning
|
||||
- fp8
|
||||
- name: codellama-7b
|
||||
url: github:mudler/LocalAI/gallery/codellama.yaml@master
|
||||
urls:
|
||||
|
||||
36
gallery/sglang-gemma-4-e2b-mtp.yaml
Normal file
36
gallery/sglang-gemma-4-e2b-mtp.yaml
Normal file
@@ -0,0 +1,36 @@
|
||||
---
|
||||
name: "sglang-gemma-4-e2b-mtp"
|
||||
|
||||
config_file: |
|
||||
backend: sglang
|
||||
parameters:
|
||||
model: google/gemma-4-E2B-it
|
||||
max_tokens: 4096
|
||||
context_size: 4096
|
||||
function:
|
||||
disable_no_action: true
|
||||
grammar:
|
||||
disable: true
|
||||
parallel_calls: true
|
||||
expect_strings_after_json: true
|
||||
template:
|
||||
use_tokenizer_template: true
|
||||
options:
|
||||
- tool_parser:gemma4
|
||||
- reasoning_parser:gemma4
|
||||
# Gemma 4 E2B-it served by SGLang with Multi-Token Prediction (MTP).
|
||||
# Flags transcribed verbatim from the SGLang cookbook:
|
||||
# https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#speculative-decoding-mtp-server-commands
|
||||
# NEXTN is normalised to EAGLE inside ServerArgs.__post_init__.
|
||||
# mem_fraction_static=0.85 adapts to the available GPU; E2B is the
|
||||
# smaller variant of the Gemma 4 lineup and the natural fit for
|
||||
# consumer GPUs (notably 8–12 GB cards). Requires sglang built with
|
||||
# PR #21952 (Gemma 4 model support); LocalAI's pinned release
|
||||
# carries it.
|
||||
engine_args:
|
||||
mem_fraction_static: 0.85
|
||||
speculative_algorithm: NEXTN
|
||||
speculative_draft_model_path: google/gemma-4-E2B-it-assistant
|
||||
speculative_num_steps: 5
|
||||
speculative_num_draft_tokens: 6
|
||||
speculative_eagle_topk: 1
|
||||
36
gallery/sglang-gemma-4-e4b-mtp.yaml
Normal file
36
gallery/sglang-gemma-4-e4b-mtp.yaml
Normal file
@@ -0,0 +1,36 @@
|
||||
---
|
||||
name: "sglang-gemma-4-e4b-mtp"
|
||||
|
||||
config_file: |
|
||||
backend: sglang
|
||||
parameters:
|
||||
model: google/gemma-4-E4B-it
|
||||
max_tokens: 4096
|
||||
context_size: 4096
|
||||
function:
|
||||
disable_no_action: true
|
||||
grammar:
|
||||
disable: true
|
||||
parallel_calls: true
|
||||
expect_strings_after_json: true
|
||||
template:
|
||||
use_tokenizer_template: true
|
||||
options:
|
||||
- tool_parser:gemma4
|
||||
- reasoning_parser:gemma4
|
||||
# Gemma 4 E4B-it served by SGLang with Multi-Token Prediction (MTP).
|
||||
# Flags transcribed verbatim from the SGLang cookbook:
|
||||
# https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#speculative-decoding-mtp-server-commands
|
||||
# NEXTN is normalised to EAGLE inside ServerArgs.__post_init__.
|
||||
# mem_fraction_static=0.85 adapts to the available GPU; E4B is the
|
||||
# mid-size variant (8B total / 4B effective parameters) and targets
|
||||
# consumer GPUs in the 16–24 GB range. Requires sglang built with
|
||||
# PR #21952 (Gemma 4 model support); LocalAI's pinned release
|
||||
# carries it.
|
||||
engine_args:
|
||||
mem_fraction_static: 0.85
|
||||
speculative_algorithm: NEXTN
|
||||
speculative_draft_model_path: google/gemma-4-E4B-it-assistant
|
||||
speculative_num_steps: 5
|
||||
speculative_num_draft_tokens: 6
|
||||
speculative_eagle_topk: 1
|
||||
34
gallery/sglang-mimo-7b-mtp.yaml
Normal file
34
gallery/sglang-mimo-7b-mtp.yaml
Normal file
@@ -0,0 +1,34 @@
|
||||
---
|
||||
name: "sglang-mimo-7b-mtp"
|
||||
|
||||
config_file: |
|
||||
backend: sglang
|
||||
parameters:
|
||||
model: XiaomiMiMo/MiMo-7B-RL
|
||||
max_tokens: 4096
|
||||
context_size: 4096
|
||||
trust_remote_code: true
|
||||
function:
|
||||
disable_no_action: true
|
||||
grammar:
|
||||
disable: true
|
||||
parallel_calls: true
|
||||
expect_strings_after_json: true
|
||||
template:
|
||||
use_tokenizer_template: true
|
||||
# Xiaomi MiMo-7B-RL with built-in Multi-Token Prediction (MTP) heads
|
||||
# served via SGLang's EAGLE-aliased speculative-decoding path. ~90%
|
||||
# acceptance rate per the model card. Quantised to fp8 at load time
|
||||
# so the 7 B target fits on a 16 GB consumer GPU; mem_fraction_static
|
||||
# is reduced from sglang's 0.85 default because the MTP draft worker
|
||||
# loads its vocab embedding unquantised (bf16, ~1.2 GiB for MiMo's
|
||||
# 152k vocab × 4096 hidden) and OOMs at 0.85. Verified end-to-end on
|
||||
# an RTX 5070 Ti (16 GB) at ~88 tok/s.
|
||||
engine_args:
|
||||
dtype: bfloat16
|
||||
quantization: fp8
|
||||
mem_fraction_static: 0.7
|
||||
speculative_algorithm: EAGLE
|
||||
speculative_num_steps: 1
|
||||
speculative_eagle_topk: 1
|
||||
speculative_num_draft_tokens: 2
|
||||
43
gallery/sglang.yaml
Normal file
43
gallery/sglang.yaml
Normal file
@@ -0,0 +1,43 @@
|
||||
---
|
||||
name: "sglang"
|
||||
|
||||
config_file: |
|
||||
backend: sglang
|
||||
context_size: 8192
|
||||
parameters:
|
||||
max_tokens: 8192
|
||||
function:
|
||||
disable_no_action: true
|
||||
grammar:
|
||||
disable: true
|
||||
parallel_calls: true
|
||||
expect_strings_after_json: true
|
||||
template:
|
||||
use_tokenizer_template: true
|
||||
# Uncomment to specify a quantization method (optional)
|
||||
# quantization: "fp8"
|
||||
# Uncomment to set dtype: "auto", "half", "float16", "bfloat16", "float", "float32"
|
||||
# dtype: "bfloat16"
|
||||
# Uncomment to limit static GPU memory (sglang's mem_fraction_static — analogous to vLLM gpu_memory_utilization)
|
||||
# gpu_memory_utilization: 0.75
|
||||
# Uncomment to trust remote code from huggingface
|
||||
# trust_remote_code: true
|
||||
# Uncomment to disable CUDA graph capture (sglang's disable_cuda_graph)
|
||||
# enforce_eager: true
|
||||
# Uncomment to specify the maximum length of a sequence (sglang's context_length)
|
||||
# max_model_len: 32768
|
||||
# Uncomment and specify the number of Tensor divisions
|
||||
# tensor_parallel_size: 2
|
||||
#
|
||||
# Anything ServerArgs exposes (~380 fields including speculative
|
||||
# decoding, attention backend, MoE/EP, hierarchical cache, …) can be
|
||||
# passed verbatim under engine_args:. See
|
||||
# https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py
|
||||
# for the canonical list. Unknown keys fail at load time with a
|
||||
# close-match suggestion.
|
||||
# engine_args:
|
||||
# speculative_algorithm: EAGLE3
|
||||
# speculative_draft_model_path: lmsys/SGLang-EAGLE3-Llama-3.1-8B-Instruct-SpecForge
|
||||
# speculative_num_steps: 3
|
||||
# speculative_eagle_topk: 4
|
||||
# speculative_num_draft_tokens: 16
|
||||
Reference in New Issue
Block a user