Researched: W4A4 hangs on GB10 because FlashInfer ships no FP4 cubins for
sm_120/121 (all datacenter Sm100a); dense mm_fp4 is gated-off/returns-zeros on
consumer Blackwell, and the FlashInfer FP4 autotuner spins on the first forward
pass. Not a misconfig - dense W4A4 inference isn't validated on sm_121. W4A16
(4-bit weight / 16-bit act, Marlin) vs llama Q4_K_M is the correct apples-to-
apples (same quant class) AND the fast path. Removed the misleading 'W4A4 would
be faster / lower bound' framing. Sources: vllm #30163/#26381, flashinfer
#2577/#3294, cutlass #3096.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Confirms parity (dense+MoE, both phases) is strictly the FP4 tensor-core kernel;
no config/flag shortcut remains.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
vLLM W4A16 vs llama Q4_K_M dense: prefill 7.6-32x behind (llama plateaus ~765,
vLLM scales to 24.4k); decode ~parity at B=1 (weight-bandwidth-bound), 2.2x at
B=64. Full NVFP4 (W4A4) hangs on this vLLM/GB10 stack - W4A16 used. Decision:
the Lever-3 kernel track must ALSO deliver a non-grouped FP4 dense GEMM, not just
the MoE grouped GEMM (dense GEMM is the simpler first kernel to land).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Decode-dominated B=64 nsys: mul_mat_q<MXFP4> 54.6%, attention only 19.8%. Both
phases are FP4-MoE-kernel-bound (Lever 3). The paged series cannot close the vLLM
gap in either phase; its real value is capacity + prefix-sharing, not tok/s parity.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Prefill 6-48x behind and does NOT scale with B (kernel-bound, paging can't fix).
Decode: we win at B=1; 2.5-3.7x behind at B>=8 - THAT concurrency gap is the
engine's domain (0004 pool + 0005 continuous batching target it). Baseline for
the series to improve on.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>