docs(paged): Lever-3 phase-1 nwarps tweak = dead end (constants coupled)

static_assert(nwarps*tile_C::I == mmq_y) locks nwarps=8 for mmq_y=128; can't raise occupancy without co-scaling mmq_y (blows Blackwell smem). MMQ kernel is not freely tunable -> parity needs the tcgen05/CUTLASS rewrite, not knobs. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 01:16:58 -04:00 · 2026-06-19 22:32:02 +00:00
parent 1449b806ab
commit b14214620c
1 changed files with 4 additions and 0 deletions
--- a/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md
+++ b/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md
@@ -206,6 +206,10 @@ Phases (each: hypothesis → edit `ggml/src/ggml-cuda/` → `cmake --build build
 `llama-bench` MXFP4 pp/concurrency → record):
 1. **Cheap kernel tweaks (low confidence, fast).** nwarps (occupancy), `mmq_y` tile, stream-k on/off,
   FP4 load-tile path. Measure each. Likely small (<1.3x) — these don't change the warp-MMA ceiling.
+   - **Result (nwarps):** DEAD END. `nwarps` is locked by `static_assert(nwarps*tile_C::I == mmq_y)`
+     (mmq.cuh:3234) → nwarps=8 for mmq_y=128. Can't raise occupancy without co-scaling mmq_y to 256
+     (nwarps=16), which blows Blackwell shared-memory limits. The MMQ constants are tightly coupled;
+     it is not freely tunable. Confirms parity needs the kernel rewrite (phase 3), not knobs.
 2. **Fuse activation quant** (`quantize_mmq_mxfp4`, 8%) into the permute/gather. Removes a kernel +
   a global round-trip. Tractable, ~1.1x.
 3. **The real lever — tcgen05 / CUTLASS FP4 grouped GEMM.** Replace the per-expert MMQ scheduler with a