From 62f0ae17e34efd5643c77ac50eda86c6d92bb1ff Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Fri, 19 Jun 2026 22:44:11 +0000 Subject: [PATCH] docs(paged): upstream survey - no FP4 MoE GEMM to patch in; phase 3 is from-scratch No tcgen05/CUTLASS grouped-GEMM MoE kernel exists upstream (merged/in-flight/ draft); CUTLASS not a dep; no fork has one; activation-quant gather already fused. Matching vLLM needs a from-scratch tcgen05 grouped GEMM (months, maintainers deferring to cuTile). No tractable patch closes the 27x. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md b/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md index 727d0bab8..8a844b96d 100644 --- a/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md +++ b/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md @@ -212,6 +212,16 @@ Phases (each: hypothesis → edit `ggml/src/ggml-cuda/` → `cmake --build build it is not freely tunable. Confirms parity needs the kernel rewrite (phase 3), not knobs. 2. **Fuse activation quant** (`quantize_mmq_mxfp4`, 8%) into the permute/gather. Removes a kernel + a global round-trip. Tractable, ~1.1x. + - **Result:** NOT AVAILABLE as a cheap patch. `quantize_mmq_fp4_cuda` (mmq.cu:200) *already* takes + `ids_src1` — the gather is already fused into the quant. The only remaining fusion is quantize-on-load + *inside* the GEMM hot loop (intricate, ~8% ceiling, risky). ORippler's #24481 fuses the decode (MMVQ) + post-scale and intends a "BS>1" (prefill) follow-up — unwritten. Marginal; skip. + +**Upstream survey (2026-06):** there is NO tcgen05/CUTLASS grouped-GEMM MoE kernel in ggml — not merged, +not in-flight, not a draft (Discussion #18369 is talk, no PR; #18250 closed not-planned). CUTLASS is not a +dependency (the profile's `cutlass_80_tensorop` is cuBLAS-internal). No fork has a portable MoE kernel +(croll83/llama.cpp-dgx is GatedDeltaNet-focused). Maintainer signal (woachk on #17906): "the path forward +is to wait for cuTile C++." So **nothing to cherry-pick; phase 3 is genuinely from-scratch.** 3. **The real lever — tcgen05 / CUTLASS FP4 grouped GEMM.** Replace the per-expert MMQ scheduler with a CUTLASS 3.x collective-mainloop grouped GEMM (sm_120a, `e2m1` block-scaled, tcgen05 tensor-memory MMA), one problem over all experts with per-group offsets, fused act-quant. This is what vLLM/FlashInfer use.