docs(paged): B-3 mmq_y-down warp-remap NEGATIVE - bit-exact MoE ceiling ~85% of vLLM

B-3 (the 0017-deferred mmq_y-down warp-remap of the NVFP4 grouped FP4-MMA mul_mat_q) was built bit-exact on the clean 0025 base and measured: the grouped GEMM kernel itself runs -1.3% (occupancy did rise via the nwarps=4 warp-remap / 128 threads-per-CTA), but end-to-end MoE decode is FLAT (npl128 +0.4%, npl32 +0.3%, within noise) because the stream-k fixup grows +42% (mmq_y=64 doubles the row-tiles) and the step is SSM/BW-bound. md5 PASS both models, test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID 806/806 PASS. No patch 0028; DGX dev tree reverted to pristine 0025. Assessment: the bit-exact MoE GEMM/launch track is exhausted (B-1 re-graph banked ~82->85%; B-2 and B-3 are 0). Honest bit-exact MoE ceiling = ~85% of vLLM @npl128. The residual is the structural Marlin-NvFp4 grouped-GEMM gap that no bit-exact lever closes. Recommend shipping the ~85% bit-exact default and exposing the held 0026 bf16-SSM as a default-off opt-in (it reaches ~95% but is non-bit-exact and fails the MoE KL gate). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 09:57:14 -04:00 · 2026-06-26 19:10:24 +00:00
parent 1f857f179e
commit 9c1c2a6a16
2 changed files with 185 additions and 27 deletions
--- a/backend/cpp/llama-cpp/patches/paged/B_MOE_PROGRESS.md
+++ b/backend/cpp/llama-cpp/patches/paged/B_MOE_PROGRESS.md
@@ -1,35 +1,51 @@
-# B_MOE_PROGRESS.md - B-2 (down_proj act-quant retune, patch 0027) checkpoint
+# B_MOE_PROGRESS.md - B-3 (mmq_y-down warp-remap, patch 0028) checkpoint

-Agent: B2-build (GPU agent). Base: 0025 tip (DGX `~/llama-paged-dev` `2f4f5ab`, branch `b-work`),
-independent of the held hybrid 0026. Worktree: `.../feat+paged-attention`.
+Agent: B3-or-assess (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev`
+`2f4f5ab`, branch `b-work`), independent of the held hybrid 0026. Worktree: `.../feat+paged-attention`.

-## The lever (B-2 / M1)
-Bit-exact block/grid/occupancy retune of `quantize_mmq_nvfp4` (the MoE down_proj activation-quant,
-~2% of the MoE decode step). `ggml/src/ggml-cuda/quantize.cu`, `quantize_mmq_fp4_cuda` NVFP4 branch.
+## Prior: B-2 (act-quant retune) = NEGATIVE (no lift, no patch 0027). MoE ~85% of vLLM @npl128.
+B-2 proved the act-quant tax (~2%) is already optimally tiled; the structural MoE residual is the
+grouped FP4 `mul_mat_q<NVFP4>` GEMM (~27%, LPDDR5x BW floor) + bf16 projections (~10.5%). => try B-3.

-## Why it is provably byte-identical
-`quantize_mmq_nvfp4` maps thread -> column purely through the global linear index
-`gy = blockDim.x*blockIdx.y + threadIdx.x` -> `i0_base = gy*QK_NVFP4_SUB`, with NO cross-thread
-communication (no shared memory, no warp reduction) and every thread owning a disjoint output
-sub-block (its own `sub` slot in `block_fp4_mmq`). So the (thread)->output-byte map - and thus the
-produced bytes - are invariant to `blockDim.x` as long as `block_num_y` is recomputed from the SAME
-`blockDim.x`. We retune ONLY `blockDim.x`; the per-thread quant body + writeback are untouched.
+## The lever (B-3 / SPEEDUP_HUNT B rank #3)
+mmq_y-down warp-remap of the NVFP4 FP4-MMA grouped GEMM `mul_mat_q<NVFP4>` in `ggml/.../mmq.cuh`.
+mmq_y tiles the weight-row (N) dimension; lowering 128->64 raises resident CTAs (smaller per-CTA
+shared + accumulator + 128 vs 256 threads/CTA => ~2x blocks/SM) to hide LPDDR5x weight-load latency,
+WITHOUT re-reading weights (each weight row lives in exactly one row-tile => BW-neutral). The MoE
+GEMM runs at ~35% of peak BW (occupancy-limited, NOT BW-saturated), so more resident CTAs is the
+right mechanism - and it is the ONE untested occupancy lever (M-tile = NEUTRAL 0015, MINBLOCKS =
+8.7% slower 0017).

-## Change
-`static const int nvfp4_block_size` selected once via env `LLAMA_MOE_QUANT_BLOCK` (default 128 =
-baseline; final = measured GB10 winner), `block_num_y` recomputed consistently. ~20 LOC, one TU.
+## The coupling that makes it a real kernel change (not the 0017 knob alone)
+The FP4-MMA path has `static_assert(nwarps*tile_C::I == mmq_y)` (mmq.cuh:3280; tile_C::I==16 for the
+m16n8k64 block-scaled FP4 MMA). nwarps is global `256/warp_size = 8`, so mmq_y is pinned at 128. The
+0017 `GGML_CUDA_FP4_MMQ_Y` knob alone would TRIP this assert at mmq_y=64. B-3 makes nwarps TYPE-AWARE:
+`mmq_get_nwarps_device<type>()` returns mmq_y/16 = 4 for NVFP4-reduced (else stock 8), keeping the
+coupling. 2 new overloads (device template + host 3-arg) + 9 call-site swaps to `<type>`. Default
+GGML_CUDA_FP4_MMQ_Y==128 returns stock nwarps for EVERY type => default build byte-identical to stock.

-## Status: COMPLETE - NEGATIVE (no lift). Full result in B_MOE_RESULTS.md.
- [x] Branched `b-work` off 0025 (`2f4f5ab`); patch applied to quantize.cu.
- [x] Build clean (llama-completion, llama-batched-bench, test-backend-ops). BUILD_EXIT=0.
- [x] md5 gate @block=128 (default): dense 5951a5b4 == ref, MoE 07db32c2 == ref. MUL_MAT 1146/1146,
-      MUL_MAT_ID 806/806 PASS.
- [x] BIT-EXACT proof across block sizes: block 64 AND 256 -> identical md5 both models.
- [x] Sweep block {32,64,96,128,160,192,256}: end-to-end FLAT (npl32 436-438, npl128 749-752, all
-      within 0.4% noise). NO block lifts decode.
- [x] nsys quantize_mmq_nvfp4: block=128 is the FASTEST (117.4M ns; 64 +8.7%, 192 +9.9%, 256 +6.9%).
-      128 already optimal => ZERO headroom.
- [x] DECISION: no patch 0027 (does not lift). Dev tree reverted to pristine 0025. Recommend B-3.
+## Bit-exactness note (the real risk)
+The per-output K-reduction order is mmq_y-INVARIANT (each output row owned by one thread). BUT mmq_y=64
+DOUBLES nty (row-tiles), changing the stream-k kbc partition => an output tile's K-range may be split
+across CTAs at different points and recombined by `mul_mat_q_stream_k_fixup` in a different grouping =>
+FP non-associativity CAN perturb the last logit bits => greedy argmax COULD flip. So B-3 is NOT
+bit-exact-by-construction in the md5 sense; the md5 gate is EMPIRICAL. md5 fail => not bit-exact => STOP.
+
+## Status: COMPLETE - BIT-EXACT but FLAT. No patch 0028. Full result + assessment in B_MOE_RESULTS.md.
+- [x] Source-read mmq.cuh: nwarps/mmq_y coupling, FP4 MMA vec_dot, kernel+fixup+launch+case sites.
+- [x] Edited mmq.cuh: 2 nwarps overloads + 9 `<type>` swaps. git diff clean (37+/11-).
+- [x] BEFORE baseline (stock-0025 binaries, same session): dense md5 5951a5b4==ref, moe 07db32c2==ref;
+      MoE S_TG npl32=441.98, npl128=756.47.
+- [x] BUILD build-cuda @mmq_y=64 (full cuda rebuild): EXIT=0 - compiles (static_assert holds at 4*16=64).
+- [x] md5 GATE PASS both models @64; test-backend-ops MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS.
+- [x] Clean back-to-back A/B (build-cuda-base @128 vs build-cuda @64), 3 reps: npl32 +0.29%,
+      npl128 +0.40% - within the ~0.4% noise band. FLAT.
+- [x] nsys A/B: grouped GEMM kernel mmq_y=64 -1.3% FASTER, BUT stream_k_fixup +42% costlier + SSM (40%)
+      dominant & untouched => end-to-end inert. BW-bound confirmed (same as 0015/0017/B-2).
+- [x] DECIDED: FLAT -> no patch 0028. Dev tree reverted to pristine 0025 (no ggml diff), build-cuda
+      reconfigured to default + rebuilt. Bit-exact MoE ceiling = ~85% @npl128 / ~87.5% @npl32 of vLLM.
+- [x] ASSESS + RECOMMEND (in B_MOE_RESULTS.md): residual = structural Marlin-NvFp4 grouped-GEMM gap,
+      uncloseable bit-exactly; fall back to 0026 bf16-SSM opt-in (default-off, fails MoE KL gate, ~95%).

 ## Gate references
 - dense q36-27b-nvfp4 md5 == 5951a5b4d624ce891e22ab5fca9bc439
--- a/backend/cpp/llama-cpp/patches/paged/B_MOE_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/B_MOE_RESULTS.md
@@ -88,3 +88,145 @@ hook + this measurement confirm 128 is the GB10 optimum, should other hardware e
 Hand off to B-3 (patch 0028) as the next MoE GEMM lever.

 Assisted-by: Claude:opus-4.8 [Claude Code]
+
+---
+
+# B-3 (mmq_y-down warp-remap of the NVFP4 grouped GEMM) RESULT: BIT-EXACT but FLAT (no patch 0028)
+
+Agent: B3-or-assess (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev` `2f4f5ab`,
+branch `b-work`), independent of the held hybrid 0026. Lever: SPEEDUP_HUNT.md section B rank #3 - the
+0017-deferred structural `mmq_y`-down warp-remap on the grouped FP4-MMA `mul_mat_q<NVFP4>` (the ~26-27%
+MoE-specific GEMM), the only untested MoE GEMM occupancy lever.
+
+## VERDICT
+**Bit-exact (md5 PASS both models + test-backend-ops PASS), but end-to-end FLAT: npl128 +0.3-0.4%
+(consistent direction, kernel-backed) and npl32 +0.1-0.3%, both inside the ~0.4% run-to-run band. The
+warp-remap makes the grouped GEMM kernel ITSELF ~1.3% faster (occupancy DID rise) but the step is
+BW/SSM-bound, so it does NOT lift MoE decode. No patch 0028.** MoE stays ~85% of vLLM @npl128.
+
+## The change that was built+measured (bit-exact, then REVERTED)
+`ggml/src/ggml-cuda/mmq.cuh`. The FP4-MMA path couples the weight-row tile to the warp count via the
+invariant `static_assert(nwarps*tile_C::I == mmq_y)` (mmq.cuh:3280; `tile_C::I==16` for the m16n8k64
+block-scaled FP4 MMA). `nwarps` is global `256/warp_size = 8`, pinning `mmq_y=128`; the 0017
+`GGML_CUDA_FP4_MMQ_Y` knob alone would TRIP that assert at 64. B-3 makes nwarps TYPE-AWARE: a new
+`mmq_get_nwarps_device<type>()` (+ 3-arg host overload) returns `mmq_y/16 = 4` for NVFP4-reduced (else
+the stock 8), so `mmq_y=64 -> nwarps=4 -> 128 threads/CTA` (vs 256) -> ~2x resident CTAs. 2 overloads +
+9 `<type>` call-site swaps (kernel, process_tile, write_back_mma, stream_k_fixup, nvfp4 loader, 2 host).
+Built with `-DGGML_CUDA_FP4_MMQ_Y=64`; the compile SUCCEEDS (the static_assert now holds at 4*16=64).
+**Default `GGML_CUDA_FP4_MMQ_Y==128` returns stock nwarps for every type => a default build is
+byte-identical to stock** (the bit-exact opt-out, proven by the md5 below at 128).
+
+### Bit-exactness is EMPIRICAL here (not by-construction)
+The per-output K-reduction order is mmq_y-invariant (each output row owned by one thread), but mmq_y=64
+DOUBLES `nty` (row-tiles), changing the stream-k `kbc` partition => an output tile's K-range can be
+split across CTAs at different points and recombined by `mul_mat_q_stream_k_fixup` in a different
+grouping => FP non-associativity COULD perturb the last logit bits and flip a greedy argmax. It did NOT
+for the gate prompt (md5 matched), but B-3 is therefore NOT bit-exact-by-construction - a default-ON
+ship would be a (small) precision risk. This is a second reason not to ship it for a 0% gain.
+
+## GATE (bit-exact) - BOTH MODELS PASS
+greedy `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`:
+- dense q36-27b-nvfp4 = 5951a5b4d624ce891e22ab5fca9bc439 == ref
+- MoE   q36-35b-a3b-nvfp4 = 07db32c2bcb78d17a43ed18bc22705cd == ref
+- test-backend-ops CUDA0: **MUL_MAT 1146/1146 PASS, MUL_MAT_ID 806/806 PASS.**
+
+## MEASUREMENT 1 - end-to-end MoE decode_agg, clean BACK-TO-BACK A/B (build-cuda-base @128 vs build-cuda @64)
+`llama-batched-bench -m q36-35b-a3b-nvfp4 -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128`, S_TG t/s,
+3 reps alternating (no concurrent load):
+
+| npl | mmq_y=128 (base) mean | mmq_y=64 (B-3) mean | delta |
+|----:|----------------------:|--------------------:|------:|
+| 32  | 437.6 (437.3-437.7)   | 438.8 (438.4-439.1) | +0.29% |
+| 128 | 750.1 (748.9-751.1)   | 753.1 (753.0-753.4) | +0.40% |
+
+Every B-3 rep edges the base by +0.3-0.4% @npl128 (consistent, kernel-backed), but the per-build spread
+(base 748.9-751.1) OVERLAPS - it is at the edge of noise, NOT a meaningful lift. Caps the end-to-end win
+at well under 1%, nowhere near the gap to vLLM (882).
+
+## MEASUREMENT 2 - nsys kernel-level A/B (the meaningful B-3 evidence), clean, no concurrent load
+`GGML_CUDA_DISABLE_GRAPHS=1 nsys ... -npp 4 -ntg 32 -npl 128`, decode-isolated window, `cuda_gpu_kern_sum`:
+
+| kernel (% of window)            | mmq_y=128 total ns | mmq_y=64 total ns | delta  |
+|---------------------------------|-------------------:|------------------:|-------:|
+| gated_delta_net (SSM, ~40%)     | 2,335,951,709      | 2,334,847,390     | 0.0% (untouched, DOMINANT) |
+| **mul_mat_q<NVFP4, M-tile 64>** (MoE GEMM, ~26%) | **1,502,548,958** | **1,483,685,630** | **-1.26% (kernel faster)** |
+| mul_mat_q<NVFP4, M-tile 128> (router, ~3.7%)     | 224,532,704       | 210,885,920       | -6.1%  |
+| quantize_mmq_nvfp4 (act-quant, ~2%)              | 119,118,624       | 118,718,496       | -0.3%  |
+| **mul_mat_q_stream_k_fixup<128>** (~0.6%)        | **26,848,479**    | **38,117,532**    | **+42% (fixup COSTLIER)** |
+
+The warp-remap DOES what it claims at the kernel level: the grouped GEMM is **-1.3%** (more resident
+CTAs hide a sliver of weight-load latency). But (a) it is only ~26% of the step, (b) halving mmq_y
+DOUBLES the row-tiles so the stream-k fixup recombination grows **+42%** (+11.3M ns), eating ~60% of the
+GEMM's 18.9M-ns saving, and (c) the step is dominated by the gated_delta_net SSM (~40%, untouched, and
+already PAST vLLM's BW efficiency per 0018-0022) with the GEMM itself at the LPDDR5x BW floor. Net
+mul_mat region saving ~7.6M ns on a ~5.8B-ns window = ~0.13%; end-to-end +0.3-0.4% (within noise).
+**This is the definitive BW-bound proof: even a real occupancy win on the target kernel does not move
+end-to-end** - the same outcome as patch 0015 (M-tile NEUTRAL), 0017 (MINBLOCKS +8.7% slower), and B-2
+(act-quant FLAT). The MoE grouped GEMM is bandwidth-limited, not occupancy-limited, at the kernel exit.
+
+## DECISION
+No patch 0028 (B-3 does not lift end-to-end; bit-exactness is empirical, not by-construction; the fixup
+penalty + BW floor swamp the +1.3% kernel win). Dev tree reverted to pristine 0025 (no ggml diff),
+build-cuda reconfigured to default (no flag) and rebuilt. The `mmq_get_nwarps_device<type>()` remap is a
+correct, reusable warp-remap should occupancy-bound FP4 hardware ever appear; it is inert on GB10.
+
+---
+
+# FINAL ASSESSMENT - the honest bit-exact MoE ceiling, and the recommendation
+
+## The bit-exact MoE GEMM/launch track is now EXHAUSTED
+| MoE lever (bit-exact) | result | MoE decode_agg @npl128 |
+|-----------------------|--------|------------------------|
+| 0025 re-graph (B-1, LANDED) | the ONLY bit-exact MoE win | ~82% -> **~85%** of vLLM |
+| B-2 act-quant retune (no patch) | FLAT (128 already optimal) | +0% |
+| B-3 mmq_y-down warp-remap (no patch) | FLAT (kernel -1.3%, e2e +0.3% noise) | +0% |
+
+**Honest bit-exact MoE ceiling on GB10 = ~85% of vLLM @npl128 (753 / 882.2), ~87.5% @npl32 (439 / 500.8).**
+B-1 (re-graph, in 0025) banked the move from ~82% to ~85%; B-2 and B-3 each add 0. The grouped-GEMM/
+launch track has no remaining bit-exact headroom.
+
+## Is the residual the structural Marlin-MoE gap? YES.
+The remaining ~15% is structural and uncloseable bit-exactly, decomposed from the nsys:
+- **Grouped FP4 GEMM (~26%) is at the LPDDR5x BW floor.** B-3 proved an occupancy win there is
+  end-to-end-inert. vLLM ships a purpose-built **Marlin-NvFp4** grouped GEMM (a different, more
+  bandwidth-efficient schedule); llama runs native FP4-MMA W4A4 (a HIGHER arithmetic tier, but the
+  decode shape is BW-bound so the tier does not help). This is THE structural gap and matches
+  FP4_GEMM_SCOPE_B.md's "MoE ceiling ~76% from the GEMM track alone."
+- **The SSM recurrence (~40%) is already PAST vLLM** (84.6% vs 82.4% peak BW, 0018-0022) - not a lever.
+- **bf16 projections (~10.5%)** - both engines pay similar; not a bit-exact lever.
+
+No bit-exact lever closes the structural grouped-GEMM gap. ~85% is the honest bit-exact MoE plateau.
+
+## RECOMMENDATION: ship the bit-exact ~85% as DEFAULT; expose 0026 bf16-SSM as a documented opt-in for the last ~10% on MoE (NOT default, NOT in the recommended config)
+
+Per the user's decision rule ("pursue B first; if it cannot reach/beat vLLM on MoE, fall back to the
+held hybrid/bf16 opt-in"): **B (bit-exact) cannot reach vLLM on MoE (~85%), so the fallback applies -
+but with a hard caveat the team must carry.**
+
+1. **DEFAULT = the bit-exact plateau (0025 with the re-graph), MoE ~85% of vLLM.** This is the honest,
+   precision-safe ship: the recurrence already BEATS vLLM's BW efficiency, the GEMM is the same FP4
+   arithmetic class, and the output is byte-identical to the f32 reference. Do not claim MoE *parity*
+   bit-exactly - claim ~85% with a precision profile at-or-above vLLM.
+
+2. **FALLBACK (opt-in only) = 0026 hybrid bf16-SSM.** It is the ONLY remaining MoE lever (it speeds the
+   ~40% recurrence, the part B does not touch): measured **+11.5% MoE decode** (1110.7 -> 1238.1 t/s in
+   the 0026 harness) -> would lift MoE ~85% -> **~95% of vLLM**. BUT: (a) it is **non-bit-exact**; (b) it
+   **FAILS the MoE KL ship-gate by a wide margin** (MeanKLD ~0.045 / Same-top-p ~91% vs the 1e-3 / 99.5%
+   bar - the gated-DeltaNet state is hypersensitive to bf16; A_HYBRID_SSM_RESULTS.md: "MoE has NO low-KL
+   regime ... Do NOT put a hybrid T in the gallery/recommended config"); and (c) even then it reaches
+   **~95%, not a clean beat** of vLLM, while conceding precision vLLM keeps (all-f32 SSM state).
+
+   => Ship 0026 default-OFF (`ssm_hybrid_tau_thresh = 0` / no `--ssm-bf16-tau`); expose the bf16-SSM as
+   an EXPLICIT opt-in flag for callers who knowingly accept a real MoE precision regression for ~+11.5%
+   decode (~95% of vLLM). Keep it OUT of the gallery/recommended MoE config.
+
+**Bottom line for the parent:** bit-exact MoE on GB10 plateaus at **~85% of vLLM** and the residual is
+the structural Marlin-NvFp4 grouped-GEMM gap that NO bit-exact lever closes (B-1 banked the re-graph;
+B-2 and B-3 are 0). Bit-exact does NOT reach/beat vLLM on MoE. The only lever that closes more (to ~95%)
+is the held 0026 bf16-SSM, which is **non-bit-exact AND fails the MoE KL gate** - so it ships **opt-in,
+default-off, not in the recommended config**, not as the default. Recommend shipping the honest ~85%
+bit-exact default and documenting the opt-in for users who accept the precision tradeoff. Do not market
+MoE parity; the bit-exact default is ~85% with a precision profile at-or-above vLLM, which is the
+defensible claim.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]