mirror of https://github.com/mudler/LocalAI.git synced 2026-05-16 20:52:08 -04:00

Files

Richard Palethorpe c894d9c826 feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686 )

Bring the sglang Python backend up to feature parity with vllm by adding
the same engine_args:-map plumbing the vLLM backend already has. Any
ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model
YAML, including the speculative-decoding flags needed for Multi-Token
Prediction. Validation matches the vllm backend's: keys are checked
against dataclasses.fields(ServerArgs), unknown keys raise ValueError
with a difflib close-match suggestion at LoadModel time, and the typed
ModelOptions fields keep their existing meaning with engine_args
overriding them.

Backend code:
* backend/python/sglang/backend.py: add _apply_engine_args, import
  dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed ->
  sampling_seed (sglang 0.5.11 renamed the SamplingParams field).
* backend/python/sglang/test.py + test.sh + Makefile: six unit tests
  exercising the helper directly (no engine load required).

Build / CI / backend gallery (cuda13 + l4t13 paths are now first-class):
* backend/python/sglang/install.sh: add --prerelease=allow because
  sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels;
  add --index-strategy=unsafe-best-match for cublas12 so the cu128
  torch index wins over default-PyPI's cu130; new pyproject.toml-driven
  l4t13 install path so [tool.uv.sources] can pin torch/torchvision/
  torchaudio/sglang to the jetson-ai-lab index without forcing every
  transitive PyPI dep through the L4T mirror's flaky proxy (mirrors the
  equivalent fix in backend/python/vllm/install.sh).
* backend/python/sglang/pyproject.toml (new): L4T project spec with
  explicit-source jetson-ai-lab index. Replaces requirements-l4t13.txt
  for the l4t13 BUILD_PROFILE; other profiles still go through the
  requirements-*.txt pipeline via libbackend.sh's installRequirements.
* backend/python/sglang/requirements-l4t13.txt: removed; superseded
  by pyproject.toml.
* backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin
  sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13
  (new files) and cu128 torch index for cublas12 (default PyPI now
  ships cu130 torch wheels by default and breaks cu12 hosts).
* backend/index.yaml: add cuda13-sglang and cuda13-sglang-development
  capability mappings + image entries pointing at
  quay.io/.../-gpu-nvidia-cuda-13-sglang.
* .github/workflows/backend.yml: new cublas13 sglang matrix entry,
  mirroring vllm's cuda13 build.

Model gallery + docs:
* gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml.
* gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos
  transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands.
* gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads
  + online fp8 weight quantization, verified end-to-end on a 16 GB
  RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the
  MTP draft worker's vocab embedding is loaded unquantised and OOMs
  the static reservation at sglang's 0.85 default.
* gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp,
  gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang).
* docs/content/features/text-generation.md: new SGLang section with
  setup, engine_args reference, MTP demos, version requirements.
* .agents/sglang-backend.md (new): agent one-pager covering the flat
  ServerArgs structure, the typed-vs-engine_args precedence, the
  speculative-decoding cheatsheet, and the mem_fraction_static gotcha
  documented above.
* AGENTS.md: index entry for the new agent doc.

Known limitation: the two Gemma 4 MTP gallery entries ship a recipe
that doesn't yet run on stock libraries. The drafter checkpoints
(google/gemma-4-{E2B,E4B}-it-assistant) declare
model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which
neither transformers (<=5.6.0, including the SGLang cookbook's pinned
commit 91b1ab1f... and main HEAD) nor sglang's own model registry
(<=0.5.11) registers as of 2026-05-06. They will start working when
HF or sglang upstream registers the architecture -- no LocalAI
changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work
today on this build (verified on RTX 5070 Ti, 16 GB).

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch]

Signed-off-by: Richard Palethorpe <io@richiejp.com>

2026-05-07 17:27:29 +02:00

4.7 KiB

Raw Blame History

LocalAI Agent Instructions

This file is the entry point for AI coding assistants (Claude Code, Cursor, Copilot, Codex, Aider, etc.) working on LocalAI. It is an index to detailed topic guides in the .agents/ directory. Read the relevant file(s) for the task at hand — you don't need to load all of them.

Human contributors: see CONTRIBUTING.md for the development workflow.

Policy for AI-Assisted Contributions

LocalAI follows the Linux kernel project's guidelines for AI coding assistants. Before submitting AI-assisted code, read .agents/ai-coding-assistants.md. Key rules:

No Signed-off-by from AI. Only the human submitter may sign off on the Developer Certificate of Origin.
No Co-Authored-By: <AI> trailers. The human contributor owns the change.
Use an Assisted-by: trailer to attribute AI involvement. Format: Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2].
The human submitter is responsible for reviewing, testing, and understanding every line of generated code.

Topics

File	When to read
.agents/ai-coding-assistants.md	Policy for AI-assisted contributions — licensing, DCO, attribution
.agents/building-and-testing.md	Building the project, running tests, Docker builds for specific platforms
.agents/ci-caching.md	CI build cache layout (registry-backed BuildKit cache on quay.io/go-skynet/ci-cache), `DEPS_REFRESH` weekly cache-buster for unpinned Python deps, manual eviction
.agents/adding-backends.md	Adding a new backend (Python, Go, or C++) — full step-by-step checklist, including importer integration (the `/import-model` dropdown is server-driven from `GET /backends/known`)
.agents/coding-style.md	Code style, editorconfig, logging, documentation conventions
.agents/llama-cpp-backend.md	Working on the llama.cpp backend — architecture, updating, tool call parsing
.agents/vllm-backend.md	Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks
.agents/sglang-backend.md	Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling
.agents/testing-mcp-apps.md	Testing MCP Apps (interactive tool UIs) in the React UI
.agents/api-endpoints-and-auth.md	Adding API endpoints, auth middleware, feature permissions, user access control
.agents/debugging-backends.md	Debugging runtime backend failures, dependency conflicts, rebuilding backends
.agents/adding-gallery-models.md	Adding GGUF models from HuggingFace to the model gallery
.agents/localai-assistant-mcp.md	LocalAI Assistant chat modality — adding admin tools to the in-process MCP server, editing skill prompts, keeping REST + MCP + skills in sync

Quick Reference

Logging: Use github.com/mudler/xlog (same API as slog)
Go style: Prefer any over interface{}
Comments: Explain why, not what
Docs: Update docs/content/ when adding features or changing config
New API endpoints: LocalAI advertises its capability surface in several independent places — swagger @Tags, /api/instructions registry, auth RouteFeatureRegistry, React UI capabilities.js, docs. Read .agents/api-endpoints-and-auth.md and follow its checklist — missing any surface means clients, admins, and the UI won't know the endpoint exists.
Admin endpoints → MCP tool: every admin endpoint that an admin would manage conversationally (install/list/edit/toggle/upgrade) MUST also be exposed as an MCP tool in pkg/mcp/localaitools/. The LocalAI Assistant chat modality and the standalone local-ai mcp-server consume that package; drift between REST and MCP is a real risk. Read .agents/localai-assistant-mcp.md — the TestToolHTTPRouteMappingComplete test fails until you wire the new tool and update the route map.
Build: Inspect Makefile and .github/workflows/ — ask the user before running long builds
UI: The active UI is the React app in core/http/react-ui/. The older Alpine.js/HTML UI in core/http/static/ is pending deprecation — all new UI work goes in the React UI

4.7 KiB Raw Blame History

LocalAI Agent Instructions

Policy for AI-Assisted Contributions

Topics

Quick Reference

4.7 KiB

Raw Blame History