Design + plan for the #1 prefill lever: NVFP4 weight GEMM at large M, where
MMQ (decode/M<=128-tuned, 1 CTA/SM, 128-col tile cap) is ~3.4x slower than
vLLM's marlin/cutlass large-M path (~51% of the prefill gap).
Recommends (a) dequant->bf16 cuBLAS routed by an M-threshold (dense first,
MoE grouped-cuBLAS second); rejects (b) a from-scratch Marlin/FP4 kernel as a
multi-week project. Key enabling finding: NVFP4->bf16 dequant kernels already
exist, and NVFP4 is currently force-excluded from the tensor-core cuBLAS path
(falls to f32 Sgemm) - relaxing that one guard is the pivot. Honest: bf16-cuBLAS
banks ~60-75% of the GEMM gap, not full 68us/tok parity (bf16 TC peak ~half FP4).
Design only - no kernel, no GPU run.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]