LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-25 09:09:07 -04:00

Files

Ettore Di Giacinto ea634ee958 docs(paged): scope track B - FP4-MMA decode-GEMM roofline + parity go/no-go

Roofline at the decode batch shape (M=128, NVFP4 weights) on GB10 (sm_121):
the dense weight-read floor (~1,940 tok/s) and MoE floor (~1,590 tok/s) sit
4-6x above vLLM's 391/811, so 273 GB/s is NOT the wall. At FP4 peak the GEMM is
bandwidth-bound (crossover M*~611 >> 128); at the kernel's ~3% achieved FP4
efficiency it is compute-bound by its own inefficiency (471 ms vs a 66 ms floor).

Verdict: dense decode parity is plausibly reachable via a tuned FP4-MMA decode
M-tile (track B) + fused act-quant (track A), landing 376-394 tok/s = 90-103% of
vLLM 391, but only at the top of the demonstrated GB10 FP4 envelope (~17-21%) and
with no margin (occupancy wall is the binding constraint, not bandwidth). MoE
parity is NOT reachable from the GEMM alone (ceiling ~60-76% of 811): its floor
is the hardest grouped-GEMM regime and ~24% of its step is non-GEMM work outside
track B. GO (conditional) for dense, PARTIAL for MoE. Build-ready phased plan
included; tune the existing block_fp4_mmq path, not a W4A16 rewrite.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-24 14:09:41 +00:00

paged

feat(paged): target-readiness for 2xH200 - correctness PASS, load-gen harness, projection

2026-06-21 23:16:28 +00:00

patches

docs(paged): scope track B - FP4-MMA decode-GEMM roofline + parity go/no-go

2026-06-24 14:09:41 +00:00

CMakeLists.txt

fix(turboquant): resolve common.h by detecting llama-common vs common target (#9413 )

2026-04-18 20:30:28 +02:00

grpc-server.cpp

feat(llama-cpp/paged): dynamic decode-first prefill budget (patch 0016, continuous-batch P1)