Files
LocalAI/backend/cpp/llama-cpp/paged/FP4_GROUPED_MOE_KERNEL.md
Ettore Di Giacinto d2651c86d9 bench(dense): root-cause the W4A4 NVFP4 hang; W4A16 vs Q4 is the headline
Researched: W4A4 hangs on GB10 because FlashInfer ships no FP4 cubins for
sm_120/121 (all datacenter Sm100a); dense mm_fp4 is gated-off/returns-zeros on
consumer Blackwell, and the FlashInfer FP4 autotuner spins on the first forward
pass. Not a misconfig - dense W4A4 inference isn't validated on sm_121. W4A16
(4-bit weight / 16-bit act, Marlin) vs llama Q4_K_M is the correct apples-to-
apples (same quant class) AND the fast path. Removed the misleading 'W4A4 would
be faster / lower bound' framing. Sources: vllm #30163/#26381, flashinfer
#2577/#3294, cutlass #3096.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 06:59:50 +00:00

4.2 KiB
Raw Blame History

FP4 grouped-GEMM MoE kernel (Lever 3) — scaffold + implementation plan

The one piece of work that actually closes the vLLM gap on Blackwell (GB10/sm_121). Both phases are bottlenecked by the same kernel: mul_mat_q<MXFP4> (warp-level mma.sync grouped MMQ, ~22 TFLOP/s) is 37% of prefill and 54.6% of decode-at-B=64 GPU time (BENCHMARKS.md). Paged attention can't touch it (proven). The fix is a CUTLASS-3.x collective-mainloop grouped GEMM with block-scaled e2m1 operands via tcgen05 tensor-memory MMA — what vLLM/FlashInfer/TRT-LLM use.

Scaffold (DONE — builds clean, default byte-identical)

Lives in the DGX checkout ~/llama.cpp-pr24423/ggml/src/ggml-cuda/ (to be rebased onto the pin as a patch / upstreamed). Captured diff: patches/kernel/0001-fp4-grouped-moe-scaffold.patch.

  • fp4-grouped-moe.{cuh,cu} — entry ggml_cuda_fp4_grouped_moe(ctx, src0, src1, ids, dst) -> bool (true = handled, false = fall back to MMQ). Gated behind env GGML_CUDA_FP4_GROUPED. Currently always returns false → default build unchanged.
  • Hook in ggml_cuda_mul_mat_id (the MoE dispatch), before the ggml_cuda_mul_mat_q(...ids...) call: if (ggml_cuda_fp4_grouped_moe(...)) return;. Builds via the file(GLOB "*.cu") (re-run cmake configure after adding the file — GLOB is configure-time).

This is the integration seam. The kernel fills the stub.

Implementation phases (each: build on GB10 → numerical parity vs mul_mat_q<MXFP4> → bench)

  1. Reference grouped GEMM (correctness first, slow OK). Per-expert problem sizes + offsets from ids; dequant e2m1+scales → BF16; loop CUTLASS (or cuBLAS) per group. Gate: output matches MMQ within fp tol on a 2-expert toy + the real model (token-identical greedy). Establishes the harness + the data plumbing.
  2. CUTLASS GemmGrouped, sm_120a, BF16 operands. Replace the loop with one cutlass::gemm::device:: GemmGrouped launch over all experts (per-group offsets). Measures the grouping win alone.
  3. Block-scaled FP4 operands (the real lever). e2m1 A/B with e8m0(MX)/e4m3(NV) block scales via the Blackwell scaled-MMA collective (tcgen05 tensor-memory). This is where the TFLOP/s jumps. Needs CUTLASS 3.x + sm_120a; verify the block-scale layout matches ggml's MXFP4/NVFP4 packing.
  4. Fuse activation quant (the F32→FP4 of src1) into the gather/permute prologue.
  5. Enable by default on sm_120/121 when parity holds + faster; keep the env as an escape hatch.

Dependencies / decisions

  • CUTLASS is not currently a ggml dependency (the profile's cutlass_80_tensorop is cuBLAS-internal). Adding it = submodule/fetch + include dir, gated to CUDA sm_120+. Float the approach with ggml maintainers early (Discussion #18369 is the home; JohannesGaessler asked to discuss arch before big kernel work).
  • Target sm_120a/121a (consumer Blackwell). Datacenter Blackwell (sm_100) is a separate tile config.
  • Risk: needs ncu-driven iteration on the GB10; this is multi-week, expert-CUDA. No upstream base to fork (exhaustive search confirmed). Net-new value upstream.

DENSE scope — RESOLVED (TODO #28, benchmarked): dense needs an FP4 GEMM too

Benchmarked Qwen3-32B dense, vLLM W4A16 vs llama.cpp Q4_K_M (BENCHMARKS.md). Dense prefill is 7.632× behind (llama int8-MMQ plateaus ~765 t/s; vLLM FP4 scales to 24.4k); decode ~parity at B=1, 2.2× at B=64. So the kernel track is two kernels, not one:

  • (a) Dense FP4 GEMM — a plain non-grouped CUTLASS/tcgen05 block-scaled FP4 GEMM. Simpler than grouped; land this FIRST — it's the easier first kernel, benefits every dense model, and de-risks the FP4 collective before the grouped variant. Hook: the non-MoE ggml_cuda_mul_mat_q (no ids) path.
  • (b) MoE grouped FP4 GEMM — the scaffold above (ggml_cuda_fp4_grouped_moe), per-expert offsets.

Both share the same block-scaled e2m1 collective; (a) is (b) with one group. Suggested order: build (a), prove the FP4 collective + parity harness, then generalize to (b). (Aside: full W4A4 NVFP4 doesn't run on GB10 today — FlashInfer ships no FP4 cubins for sm_121, so the dense mm_fp4 kernel hangs/returns zeros; the W4A16 Marlin path is the fast, correct one and is the fair comparison. See BENCHMARKS.md for the root cause.)