LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Files

Ettore Di Giacinto 8925c009b7 docs(paged): scope durable grouped FP4-MMA MoE GEMM port for GB10

Build-ready plan (not implemented) for matching/beating vLLM MoE
grouped-GEMM efficiency on GB10 sm_121 for Qwen3-30B-A3B mxfp4.

Honest reframe: the grouped GEMM the mission scoped to build already
exists upstream and runs on GB10 for mxfp4 - should_use_mmq() routes
MUL_MAT_ID to the grouped mmq path, which already contains both vLLM
building blocks (mm_ids_helper moe_align/scatter + a persistent stream-k
FP4-MMA grouped GEMM). The npl128 cliff was a since-fixed regression, not
a batched-bench artifact; re-measured decode is monotonic 85->1771 t/s.

The one structural gap is M-tile sizing: ggml maximizes mmq_x over the
aggregate token count while vLLM uses a small per-expert BLOCK_SIZE_M, so
each tiny per-expert M-tile is 3-6% filled at decode density. Scope is a
surgical two-step delta (expert-aware mmq_x selection; block-padded
moe_align), the parity gate (test_mul_mat_id bit-exact + ragged small-M),
and a phased plan gated behind the GB10 W4A16 occupancy wall.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-23 13:17:03 +00:00

ds4

chore: ⬆️ Update antirez/ds4 to 80ebbc396aee40eedc1d829222f3362d10fa4c6c (#10378 )

2026-06-18 00:32:13 +02:00

grpc

fix: speedup git submodule update with --single-branch (#2847 )

2024-07-13 22:32:25 +02:00

ik-llama-cpp

chore: ⬆️ Update ikawrakow/ik_llama.cpp to b3dfb7858cfcb9166e92f366e5af87f19ebc94be (#10395 )

2026-06-19 00:03:37 +02:00

llama-cpp

docs(paged): scope durable grouped FP4-MMA MoE GEMM port for GB10