From ce60737fc562b6c39af772703fbb4f45a36d8fd7 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Sat, 20 Jun 2026 03:56:33 +0000 Subject: [PATCH] kernel(doc): dense scope resolved - two FP4 kernels (dense first, then grouped) Benchmark confirms dense prefill 7.6-32x behind too, so the kernel track needs a non-grouped FP4 dense GEMM (simpler, land first) + the MoE grouped GEMM. Both share the e2m1 block-scaled collective; dense is grouped-with-one-group. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../llama-cpp/paged/FP4_GROUPED_MOE_KERNEL.md | 20 ++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/backend/cpp/llama-cpp/paged/FP4_GROUPED_MOE_KERNEL.md b/backend/cpp/llama-cpp/paged/FP4_GROUPED_MOE_KERNEL.md index 80e198e08..54123c413 100644 --- a/backend/cpp/llama-cpp/paged/FP4_GROUPED_MOE_KERNEL.md +++ b/backend/cpp/llama-cpp/paged/FP4_GROUPED_MOE_KERNEL.md @@ -42,11 +42,17 @@ This is the integration seam. The kernel fills the stub. - Risk: needs ncu-driven iteration on the GB10; this is multi-week, expert-CUDA. No upstream base to fork (exhaustive search confirmed). Net-new value upstream. -## DENSE follow-up (TODO #28 — important, do before committing to MoE-only) +## DENSE scope — RESOLVED (TODO #28, benchmarked): dense needs an FP4 GEMM too -This kernel is **grouped** (MoE). **Dense** models (e.g. Qwen3 ~27B) use the non-grouped FP4 GEMM path — a -different kernel. Before assuming the kernel work is MoE-only, benchmark **Qwen3-27B dense: vLLM NVFP4 vs -llama.cpp Q4_K_M** (prefill+decode, GB10). If dense shows the same large gap → the kernel track must also -deliver a non-grouped block-scaled FP4 GEMM (a CUTLASS dense GEMM, simpler than grouped). If dense is already -competitive (single-stream dense was only ~10% of MoE-model time) → MoE-grouped is the priority and dense can -ride the existing MMQ/cuBLAS path. This decides the kernel scope. +Benchmarked Qwen3-32B dense, vLLM W4A16 vs llama.cpp Q4_K_M (`BENCHMARKS.md`). **Dense prefill is 7.6–32× +behind** (llama int8-MMQ plateaus ~765 t/s; vLLM FP4 scales to 24.4k); decode ~parity at B=1, 2.2× at B=64. +So the kernel track is **two kernels, not one**: + +- **(a) Dense FP4 GEMM** — a plain non-grouped CUTLASS/tcgen05 block-scaled FP4 GEMM. **Simpler than grouped; + land this FIRST** — it's the easier first kernel, benefits every dense model, and de-risks the FP4 collective + before the grouped variant. Hook: the non-MoE `ggml_cuda_mul_mat_q` (no `ids`) path. +- **(b) MoE grouped FP4 GEMM** — the scaffold above (`ggml_cuda_fp4_grouped_moe`), per-expert offsets. + +Both share the same block-scaled `e2m1` collective; (a) is (b) with one group. Suggested order: build (a), +prove the FP4 collective + parity harness, then generalize to (b). (Aside: full NVFP4/W4A4 currently *hangs* +on vLLM 0.23.0 / GB10 for dense — only W4A16 ran — so the measured gap is a lower bound.)