Researched: W4A4 hangs on GB10 because FlashInfer ships no FP4 cubins for sm_120/121 (all datacenter Sm100a); dense mm_fp4 is gated-off/returns-zeros on consumer Blackwell, and the FlashInfer FP4 autotuner spins on the first forward pass. Not a misconfig - dense W4A4 inference isn't validated on sm_121. W4A16 (4-bit weight / 16-bit act, Marlin) vs llama Q4_K_M is the correct apples-to- apples (same quant class) AND the fast path. Removed the misleading 'W4A4 would be faster / lower bound' framing. Sources: vllm #30163/#26381, flashinfer #2577/#3294, cutlass #3096. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
4.2 KiB
FP4 grouped-GEMM MoE kernel (Lever 3) — scaffold + implementation plan
The one piece of work that actually closes the vLLM gap on Blackwell (GB10/sm_121). Both phases are
bottlenecked by the same kernel: mul_mat_q<MXFP4> (warp-level mma.sync grouped MMQ, ~22 TFLOP/s) is
37% of prefill and 54.6% of decode-at-B=64 GPU time (BENCHMARKS.md). Paged attention can't touch
it (proven). The fix is a CUTLASS-3.x collective-mainloop grouped GEMM with block-scaled e2m1 operands via
tcgen05 tensor-memory MMA — what vLLM/FlashInfer/TRT-LLM use.
Scaffold (DONE — builds clean, default byte-identical)
Lives in the DGX checkout ~/llama.cpp-pr24423/ggml/src/ggml-cuda/ (to be rebased onto the pin as a patch /
upstreamed). Captured diff: patches/kernel/0001-fp4-grouped-moe-scaffold.patch.
fp4-grouped-moe.{cuh,cu}— entryggml_cuda_fp4_grouped_moe(ctx, src0, src1, ids, dst) -> bool(true = handled, false = fall back to MMQ). Gated behind envGGML_CUDA_FP4_GROUPED. Currently always returns false → default build unchanged.- Hook in
ggml_cuda_mul_mat_id(the MoE dispatch), before theggml_cuda_mul_mat_q(...ids...)call:if (ggml_cuda_fp4_grouped_moe(...)) return;. Builds via thefile(GLOB "*.cu")(re-run cmake configure after adding the file — GLOB is configure-time).
This is the integration seam. The kernel fills the stub.
Implementation phases (each: build on GB10 → numerical parity vs mul_mat_q<MXFP4> → bench)
- Reference grouped GEMM (correctness first, slow OK). Per-expert problem sizes + offsets from
ids; dequante2m1+scales → BF16; loop CUTLASS (or cuBLAS) per group. Gate: output matches MMQ within fp tol on a 2-expert toy + the real model (token-identical greedy). Establishes the harness + the data plumbing. - CUTLASS GemmGrouped, sm_120a, BF16 operands. Replace the loop with one
cutlass::gemm::device:: GemmGroupedlaunch over all experts (per-group offsets). Measures the grouping win alone. - Block-scaled FP4 operands (the real lever).
e2m1A/B withe8m0(MX)/e4m3(NV) block scales via the Blackwell scaled-MMA collective (tcgen05 tensor-memory). This is where the TFLOP/s jumps. Needs CUTLASS 3.x + sm_120a; verify the block-scale layout matches ggml's MXFP4/NVFP4 packing. - Fuse activation quant (the F32→FP4 of src1) into the gather/permute prologue.
- Enable by default on sm_120/121 when parity holds + faster; keep the env as an escape hatch.
Dependencies / decisions
- CUTLASS is not currently a ggml dependency (the profile's
cutlass_80_tensoropis cuBLAS-internal). Adding it = submodule/fetch + include dir, gated to CUDA sm_120+. Float the approach with ggml maintainers early (Discussion #18369 is the home; JohannesGaessler asked to discuss arch before big kernel work). - Target sm_120a/121a (consumer Blackwell). Datacenter Blackwell (sm_100) is a separate tile config.
- Risk: needs ncu-driven iteration on the GB10; this is multi-week, expert-CUDA. No upstream base to fork (exhaustive search confirmed). Net-new value upstream.
DENSE scope — RESOLVED (TODO #28, benchmarked): dense needs an FP4 GEMM too
Benchmarked Qwen3-32B dense, vLLM W4A16 vs llama.cpp Q4_K_M (BENCHMARKS.md). Dense prefill is 7.6–32×
behind (llama int8-MMQ plateaus ~765 t/s; vLLM FP4 scales to 24.4k); decode ~parity at B=1, 2.2× at B=64.
So the kernel track is two kernels, not one:
- (a) Dense FP4 GEMM — a plain non-grouped CUTLASS/tcgen05 block-scaled FP4 GEMM. Simpler than grouped;
land this FIRST — it's the easier first kernel, benefits every dense model, and de-risks the FP4 collective
before the grouped variant. Hook: the non-MoE
ggml_cuda_mul_mat_q(noids) path. - (b) MoE grouped FP4 GEMM — the scaffold above (
ggml_cuda_fp4_grouped_moe), per-expert offsets.
Both share the same block-scaled e2m1 collective; (a) is (b) with one group. Suggested order: build (a),
prove the FP4 collective + parity harness, then generalize to (b). (Aside: full W4A4 NVFP4 doesn't run on
GB10 today — FlashInfer ships no FP4 cubins for sm_121, so the dense mm_fp4 kernel hangs/returns zeros; the
W4A16 Marlin path is the fast, correct one and is the fair comparison. See BENCHMARKS.md for the root cause.)