mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-29 11:07:18 -04:00
1198d10b58ccbbb78dc9ecd93c8d1e6e5176dbd3
4 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
5cda4f1ccf |
fix(L4T13 backends): switch vllm/sglang/vllm-omni to PyPI aarch64+cu130 wheels (#9950)
* fix(vllm): switch L4T13 backend to PyPI aarch64+cu130 wheels The L4T13 vllm backend pulled torch / torchvision / torchaudio / vllm from pypi.jetson-ai-lab.io's sbsa/cu130 mirror via [tool.uv.sources] with no version pins. That mirror started shipping torch 2.11.0 next to a vllm-0.20.0+cu130 wheel that was still compiled against torch 2.10's c10 ABI, so uv landed on the mismatched pair and vllm crashed at import: ImportError: vllm/_C.abi3.so: undefined symbol: _ZN3c1013MessageLoggerC1EPKciib (c10::MessageLogger's constructor signature changed between torch 2.10 and 2.11; the vllm wheel referenced the 2.10 form, the installed libc10.so exported only the 2.11 form.) Since torch 2.11 (April 2026) PyPI publishes its own aarch64 + cu130 manylinux wheels, and vllm 0.20.0 ships an aarch64 wheel whose Requires- Dist locks torch==2.11.0 / torchvision==0.26.0 / torchaudio==2.11.0. That makes uv's resolver produce an ABI-consistent set automatically, so the mirror and the [tool.uv.sources] pinning are no longer needed. flash-attn is dropped from the dep list: PyPI has no aarch64 wheel, but vLLM 0.20+ already bundles its own vllm_flash_attn (fa2 + fa3) inside the main wheel, so the Dao-AILab package isn't required at runtime. Reference: https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/ Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] [WebFetch] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(vllm): retire l4t13 pyproject.toml in favor of requirements-*.txt pyproject.toml only existed because uv pip install -r requirements.txt doesn't honor [tool.uv.sources]. The previous commit dropped [tool.uv. sources] (PyPI now serves the aarch64 + cu130 wheels directly), so the file no longer carries any logic the requirements-*.txt path can't. Replace with the same two-file pattern every other build profile uses: - requirements-l4t13.txt (accelerate / torch / transformers / bitsandbytes - matches cublas13's split) - requirements-l4t13-after.txt (vllm; runs after the base resolve so the cu130 torch wheel lands first) install.sh's whole l4t13 elif branch goes away; libbackend.sh's installRequirements already handles the requirements-install.txt build- deps pass, the C_INCLUDE_PATH export for PORTABLE_PYTHON, and the runProtogen call, so falling through to the standard else: branch produces identical install behavior with less surface area. No functional change at install time - same wheels, same order. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang,vllm-omni): switch L4T13 backends to PyPI aarch64+cu130 wheels Same root cause and same fix as the vllm backend in the previous commits: the L4T13 sglang and vllm-omni backends both pulled their accelerator stack from pypi.jetson-ai-lab.io's sbsa/cu130 mirror with no version pins, so they would silently land on the same torch 2.11 vs cu130-built wheel ABI mismatch the moment the mirror published an out-of-sync pair. sglang ------ - Drop pyproject.toml + [tool.uv.sources]. The historical comment said the [all] extra was unsafe on aarch64 because of decord, but sglang 0.5.x now uses `decord2` on aarch64/arm/armv7l (which ships cp312 aarch64 wheels), so we can match cublas13's sglang[all]>=0.5.11 pin and stop being capped at the 0.5.1.post2 the L4T mirror shipped. That unblocks Gemma 4 / MTP recipes on Jetson Thor. - New requirements-l4t13.txt mirrors the cublas13 split (accelerate / torch / torchvision / torchaudio / transformers), requirements-l4t13- after.txt carries sglang[all]>=0.5.11. - install.sh's l4t13 elif branch goes away; falls through to the standard installRequirements path. vllm-omni --------- - requirements-l4t13.txt drops --extra-index-url to jetson-ai-lab and drops flash-attn (PyPI has no aarch64 wheel, vLLM 0.20+ bundles its own vllm_flash_attn fa2 + fa3 internally). - install.sh's l4t13 vllm-install branch collapses into the cublas13 branch since both now just run `pip install vllm --torch-backend=auto` against PyPI. - --index-strategy=unsafe-best-match is dropped from the top-level l4t13 guard; without the L4T mirror in the picture it had no purpose. The from-source vllm-omni install on top still keeps its existing `sed -i '/^fa3-fwd[[:space:]]*==/d' requirements/cuda.txt` workaround - fa3-fwd has no aarch64 wheel and no sdist, unrelated to flash-attn. Reference: https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/ Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] [WebFetch] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang): drop [all] extra on l4t13 - xatlas has no aarch64 wheel CI revealed that sglang[all]==0.5.12 transitively pulls xatlas via the [diffusion] sub-extra, and xatlas ships no aarch64 wheel. Its sdist depends on scikit_build_core without declaring it in build-system. requires, so under --no-build-isolation uv can't build it from source: × Failed to build `xatlas==0.0.11` ├─▶ The build backend returned an error ╰─▶ Call to `scikit_build_core.build.build_wheel` failed (exit status: 1) ModuleNotFoundError: No module named 'scikit_build_core' help: `xatlas` (v0.0.11) was included because `sglang[all]` (v0.5.12) depends on `xatlas` Upstream sglang explicitly gates st_attn and vsa on `platform_machine != aarch64` inside the same [diffusion] extra but forgot xatlas - same class of bug that bit the old decord pin. Use plain `sglang>=0.5.11` on l4t13. backend.py imports only base sglang.srt symbols (Engine, ServerArgs, FunctionCallParser, ReasoningParser); the [all] extras are optional accelerators not required at import time. cublas13 (x86_64) keeps [all] because xatlas has x86_64 wheels there. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
c894d9c826 |
feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686)
Bring the sglang Python backend up to feature parity with vllm by adding
the same engine_args:-map plumbing the vLLM backend already has. Any
ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model
YAML, including the speculative-decoding flags needed for Multi-Token
Prediction. Validation matches the vllm backend's: keys are checked
against dataclasses.fields(ServerArgs), unknown keys raise ValueError
with a difflib close-match suggestion at LoadModel time, and the typed
ModelOptions fields keep their existing meaning with engine_args
overriding them.
Backend code:
* backend/python/sglang/backend.py: add _apply_engine_args, import
dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed ->
sampling_seed (sglang 0.5.11 renamed the SamplingParams field).
* backend/python/sglang/test.py + test.sh + Makefile: six unit tests
exercising the helper directly (no engine load required).
Build / CI / backend gallery (cuda13 + l4t13 paths are now first-class):
* backend/python/sglang/install.sh: add --prerelease=allow because
sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels;
add --index-strategy=unsafe-best-match for cublas12 so the cu128
torch index wins over default-PyPI's cu130; new pyproject.toml-driven
l4t13 install path so [tool.uv.sources] can pin torch/torchvision/
torchaudio/sglang to the jetson-ai-lab index without forcing every
transitive PyPI dep through the L4T mirror's flaky proxy (mirrors the
equivalent fix in backend/python/vllm/install.sh).
* backend/python/sglang/pyproject.toml (new): L4T project spec with
explicit-source jetson-ai-lab index. Replaces requirements-l4t13.txt
for the l4t13 BUILD_PROFILE; other profiles still go through the
requirements-*.txt pipeline via libbackend.sh's installRequirements.
* backend/python/sglang/requirements-l4t13.txt: removed; superseded
by pyproject.toml.
* backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin
sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13
(new files) and cu128 torch index for cublas12 (default PyPI now
ships cu130 torch wheels by default and breaks cu12 hosts).
* backend/index.yaml: add cuda13-sglang and cuda13-sglang-development
capability mappings + image entries pointing at
quay.io/.../-gpu-nvidia-cuda-13-sglang.
* .github/workflows/backend.yml: new cublas13 sglang matrix entry,
mirroring vllm's cuda13 build.
Model gallery + docs:
* gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml.
* gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos
transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands.
* gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads
+ online fp8 weight quantization, verified end-to-end on a 16 GB
RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the
MTP draft worker's vocab embedding is loaded unquantised and OOMs
the static reservation at sglang's 0.85 default.
* gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp,
gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang).
* docs/content/features/text-generation.md: new SGLang section with
setup, engine_args reference, MTP demos, version requirements.
* .agents/sglang-backend.md (new): agent one-pager covering the flat
ServerArgs structure, the typed-vs-engine_args precedence, the
speculative-decoding cheatsheet, and the mem_fraction_static gotcha
documented above.
* AGENTS.md: index entry for the new agent doc.
Known limitation: the two Gemma 4 MTP gallery entries ship a recipe
that doesn't yet run on stock libraries. The drafter checkpoints
(google/gemma-4-{E2B,E4B}-it-assistant) declare
model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which
neither transformers (<=5.6.0, including the SGLang cookbook's pinned
commit 91b1ab1f... and main HEAD) nor sglang's own model registry
(<=0.5.11) registers as of 2026-05-06. They will start working when
HF or sglang upstream registers the architecture -- no LocalAI
changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work
today on this build (verified on RTX 5070 Ti, 16 GB).
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
|
||
|
|
24505e57f5 |
feat(backends): add CUDA 13 + L4T arm64 CUDA 13 variants for vllm/vllm-omni/sglang (#9553)
* feat(backends): add CUDA 13 + L4T arm64 CUDA 13 variants for vllm/vllm-omni/sglang
Adds new build profiles mirroring the diffusers/ace-step pattern so vLLM
serving (and SGLang on arm64) can be deployed on CUDA 13 hosts and
JetPack 7 boards:
- vllm: cublas13 (PyPI cu130 channel) + l4t13 (jetson-ai-lab SBSA cu130
prebuilt vllm + flash-attn).
- vllm-omni: cublas13 + l4t13. Floats vllm version on cu13 since vllm
0.19+ ships cu130 wheels by default and vllm-omni tracks vllm master;
cu12 path keeps the 0.14.0 pin to avoid disturbing existing images.
- sglang: l4t13 arm64 only — uses the prebuilt sglang wheel from the
jetson-ai-lab SBSA cu130 index, so no source build is needed.
Cublas13 sglang on x86_64 is intentionally deferred.
CI matrix gains five new images (-gpu-nvidia-cuda-13-vllm{,-omni},
-nvidia-l4t-cuda-13-arm64-{vllm,vllm-omni,sglang}); backend/index.yaml
gains the matching capability keys (nvidia-cuda-13, nvidia-l4t-cuda-13)
and latest/development merge entries.
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
* fix(backends): use unsafe-best-match index strategy on l4t13 builds
The jetson-ai-lab SBSA cu130 index lists transitive deps (decord, etc.)
at limited versions / older Python ABIs. uv defaults to the first index
that contains a package and refuses to fall through to PyPI, so sglang
l4t13 build fails resolving decord. Mirror the existing cpu sglang
profile by setting --index-strategy=unsafe-best-match on l4t13 across
the three backends, and apply it to the explicit vllm install line in
vllm-omni's install.sh (which doesn't honor EXTRA_PIP_INSTALL_FLAGS).
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash]
* fix(sglang): drop [all] extras on l4t13, floor version at 0.5.0
The [all] extra brings in outlines→decord, and decord has no aarch64
cp312 wheel on PyPI nor the jetson-ai-lab index (only legacy cp35-cp37
tags). With unsafe-best-match enabled, uv backtracked through sglang
versions trying to satisfy decord and silently landed on
sglang==0.1.16, an ancient version with an entirely different dep
tree (cloudpickle/outlines 0.0.44, etc.).
Drop [all] so decord is no longer required, and floor sglang at 0.5.0
to prevent any future resolver misfire from degrading the version
again.
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
b4e30692a2 |
feat(backends): add sglang (#9359)
* feat(backends): add sglang Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang): force AVX-512 CXXFLAGS and disable CI e2e job sgl-kernel's shm.cpp uses __m512 AVX-512 intrinsics unconditionally; -march=native fails on CI runners without AVX-512 in /proc/cpuinfo. Force -march=sapphirerapids so the build always succeeds, matching sglang upstream's docker/xeon.Dockerfile recipe. The resulting binary still requires an AVX-512 capable CPU at runtime, so disable tests-sglang-grpc in test-extra.yml for the same reason tests-vllm-grpc is disabled. Local runs with make test-extra-backend-sglang still work on hosts with the right SIMD baseline. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang): patch CMakeLists.txt instead of CXXFLAGS for AVX-512 CXXFLAGS with -march=sapphirerapids was being overridden by add_compile_options(-march=native) in sglang's CPU CMakeLists.txt, since CMake appends those flags after CXXFLAGS. Sed-patch the CMakeLists.txt directly after cloning to replace -march=native. --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |