diff --git a/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md b/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md index dff4728a1..727d0bab8 100644 --- a/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md +++ b/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md @@ -206,6 +206,10 @@ Phases (each: hypothesis → edit `ggml/src/ggml-cuda/` → `cmake --build build `llama-bench` MXFP4 pp/concurrency → record): 1. **Cheap kernel tweaks (low confidence, fast).** nwarps (occupancy), `mmq_y` tile, stream-k on/off, FP4 load-tile path. Measure each. Likely small (<1.3x) — these don't change the warp-MMA ceiling. + - **Result (nwarps):** DEAD END. `nwarps` is locked by `static_assert(nwarps*tile_C::I == mmq_y)` + (mmq.cuh:3234) → nwarps=8 for mmq_y=128. Can't raise occupancy without co-scaling mmq_y to 256 + (nwarps=16), which blows Blackwell shared-memory limits. The MMQ constants are tightly coupled; + it is not freely tunable. Confirms parity needs the kernel rewrite (phase 3), not knobs. 2. **Fuse activation quant** (`quantize_mmq_mxfp4`, 8%) into the permute/gather. Removes a kernel + a global round-trip. Tractable, ~1.1x. 3. **The real lever — tcgen05 / CUTLASS FP4 grouped GEMM.** Replace the per-expert MMQ scheduler with a