mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 16:19:07 -04:00
Captures the full dgx.casa investigation: Q8/F16/vLLM baselines, concurrency sweeps, paged-patch (no concurrency effect), nsys+code root-cause (MoE int8 MMQ on Ampere-class tensor cores = 74.5% compute, no FP8 path), and the lever plan. Measured wins: - Lever 1 (MXFP4 / Blackwell FP4 path): decode +50-66% over Q8, prefill plateau +66% (2200->3650). MXFP4 decode beats vLLM FP8 at B=1 (83 vs 48), near-parity B=8. Prefill still plateaus (fused-MoE-GEMM gap). - Lever 2 (ubatch): saturates at 2048; ceiling is the kernel, not batch. Designed (not built): Lever 3 fused FP4/FP8 MoE grouped GEMM, Lever 4 FP8 GEMM (needs ggml_mul_mat_ext scale plumbing), Lever 5 tcgen05 kernels, and the complete paged attention (on-demand alloc + gather-read + continuous batching + prefix sharing). Honest scope: each is multi-week kernel/systems work. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>