From 62f0ae17e34efd5643c77ac50eda86c6d92bb1ff Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Fri, 19 Jun 2026 22:44:11 +0000
Subject: [PATCH] docs(paged): upstream survey - no FP4 MoE GEMM to patch in;
 phase 3 is from-scratch

No tcgen05/CUTLASS grouped-GEMM MoE kernel exists upstream (merged/in-flight/
draft); CUTLASS not a dep; no fork has one; activation-quant gather already
fused. Matching vLLM needs a from-scratch tcgen05 grouped GEMM (months,
maintainers deferring to cuTile). No tractable patch closes the 27x.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
 backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md b/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md
index 727d0bab8..8a844b96d 100644
--- a/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md
+++ b/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md
@@ -212,6 +212,16 @@ Phases (each: hypothesis → edit `ggml/src/ggml-cuda/` → `cmake --build build
      it is not freely tunable. Confirms parity needs the kernel rewrite (phase 3), not knobs.
 2. **Fuse activation quant** (`quantize_mmq_mxfp4`, 8%) into the permute/gather. Removes a kernel +
    a global round-trip. Tractable, ~1.1x.
+   - **Result:** NOT AVAILABLE as a cheap patch. `quantize_mmq_fp4_cuda` (mmq.cu:200) *already* takes
+     `ids_src1` — the gather is already fused into the quant. The only remaining fusion is quantize-on-load
+     *inside* the GEMM hot loop (intricate, ~8% ceiling, risky). ORippler's #24481 fuses the decode (MMVQ)
+     post-scale and intends a "BS>1" (prefill) follow-up — unwritten. Marginal; skip.
+
+**Upstream survey (2026-06):** there is NO tcgen05/CUTLASS grouped-GEMM MoE kernel in ggml — not merged,
+not in-flight, not a draft (Discussion #18369 is talk, no PR; #18250 closed not-planned). CUTLASS is not a
+dependency (the profile's `cutlass_80_tensorop` is cuBLAS-internal). No fork has a portable MoE kernel
+(croll83/llama.cpp-dgx is GatedDeltaNet-focused). Maintainer signal (woachk on #17906): "the path forward
+is to wait for cuTile C++." So **nothing to cherry-pick; phase 3 is genuinely from-scratch.**
 3. **The real lever — tcgen05 / CUTLASS FP4 grouped GEMM.** Replace the per-expert MMQ scheduler with a
    CUTLASS 3.x collective-mainloop grouped GEMM (sm_120a, `e2m1` block-scaled, tcgen05 tensor-memory MMA),
    one problem over all experts with per-group offsets, fused act-quant. This is what vLLM/FlashInfer use.