LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Files

Ettore Di Giacinto acb22a66ed feat(paged): mirror MoE token-tile density-aware auto-select (patch 0015)

Mirror of llama-paged-dev commit 151343b into the pinned paged patch series.
The durable, default-on follow-up to patch 0014's opt-in LLAMA_MOE_MMQ_X global
cap: a host-side density-aware mmq_x auto-select in mul_mat_q_case that caps the
MUL_MAT_ID grouped FP4-MMA token-tile only at low per-expert density (decode) and
keeps the 128 tile at high density (prefill), so it is prefill-safe by construction
(removes 0014's ~1.3% prefill cost). No new kernel.

density_max default = 8 (not tile/4 = 16): 16 equals the 256-expert prefill-ubatch
density and regressed S_PP ~2% on Qwen3.6-35B-A3B NVFP4; 8 sits between decode and
prefill density for n_experts in [128,511] at n_ubatch=512.

Honest result on the mission's MoE target (Qwen3.6-35B-A3B NVFP4, 256 experts +
GDN/SSM linear attention, GB10 sm_121, median of 5 reps): NEUTRAL. Decode S_TG is
within run-to-run noise (npl128 +0.36%) and prefill S_PP neutral (within +/-0.7%).
This model is bound by the SSM recurrence and 256-tiny-expert weight bandwidth, not
the MoE col-tile occupancy, so the col-tile lever has nothing to bite on; a npl128
tile sweep confirms 64 is the only useful width (TILE8 -6.3% ... TILE96 -0.8%). The
lever's real win lives on col-tile-bound MoE (Qwen3-Coder-30B, +4.8% @npl128 per
patch 0014), which the auto-select reproduces at npl128 by construction at zero
prefill cost. Shipped default-on because it is prefill-safe, decode-neutral here,
and correctness-gated.

LLAMA_MOE_MMQ_X (0014) kept as a manual override; LLAMA_MOE_AUTO_TILE=0 restores
exact stock selection. P0 gate: test-backend-ops test_mul_mat_id ragged small-M
NVFP4/MXFP4 MoE decode-density shapes pass CUDA-vs-CPU on GB10 both default-on and
stock. Full rationale and tables in patches/paged/MOE_DENSITY_AUTO_TILE.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-23 19:04:55 +00:00

paged

feat(paged): target-readiness for 2xH200 - correctness PASS, load-gen harness, projection

2026-06-21 23:16:28 +00:00

patches

feat(paged): mirror MoE token-tile density-aware auto-select (patch 0015)

2026-06-23 19:04:55 +00:00

CMakeLists.txt

fix(turboquant): resolve common.h by detecting llama-common vs common target (#9413 )

2026-04-18 20:30:28 +02:00

grpc-server.cpp

feat(llama-cpp): per-model max_prefill_tokens option (chunked-prefill QoS budget)