mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-30 03:17:01 -04:00
Design + plan for the #1 prefill lever: NVFP4 weight GEMM at large M, where MMQ (decode/M<=128-tuned, 1 CTA/SM, 128-col tile cap) is ~3.4x slower than vLLM's marlin/cutlass large-M path (~51% of the prefill gap). Recommends (a) dequant->bf16 cuBLAS routed by an M-threshold (dense first, MoE grouped-cuBLAS second); rejects (b) a from-scratch Marlin/FP4 kernel as a multi-week project. Key enabling finding: NVFP4->bf16 dequant kernels already exist, and NVFP4 is currently force-excluded from the tensor-core cuBLAS path (falls to f32 Sgemm) - relaxing that one guard is the pivot. Honest: bf16-cuBLAS banks ~60-75% of the GEMM gap, not full 68us/tok parity (bf16 TC peak ~half FP4). Design only - no kernel, no GPU run. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]