docs(paged): B-3 mmq_y-down warp-remap NEGATIVE - bit-exact MoE ceiling ~85% of vLLM

B-3 (the 0017-deferred mmq_y-down warp-remap of the NVFP4 grouped FP4-MMA
mul_mat_q) was built bit-exact on the clean 0025 base and measured: the
grouped GEMM kernel itself runs -1.3% (occupancy did rise via the nwarps=4
warp-remap / 128 threads-per-CTA), but end-to-end MoE decode is FLAT
(npl128 +0.4%, npl32 +0.3%, within noise) because the stream-k fixup grows
+42% (mmq_y=64 doubles the row-tiles) and the step is SSM/BW-bound. md5 PASS
both models, test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID 806/806 PASS.
No patch 0028; DGX dev tree reverted to pristine 0025.

Assessment: the bit-exact MoE GEMM/launch track is exhausted (B-1 re-graph
banked ~82->85%; B-2 and B-3 are 0). Honest bit-exact MoE ceiling = ~85% of
vLLM @npl128. The residual is the structural Marlin-NvFp4 grouped-GEMM gap
that no bit-exact lever closes. Recommend shipping the ~85% bit-exact default
and exposing the held 0026 bf16-SSM as a default-off opt-in (it reaches ~95%
but is non-bit-exact and fails the MoE KL gate).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-26 19:10:24 +00:00
parent 1f857f179e
commit 9c1c2a6a16
2 changed files with 185 additions and 27 deletions

View File

@@ -1,35 +1,51 @@
# B_MOE_PROGRESS.md - B-2 (down_proj act-quant retune, patch 0027) checkpoint
# B_MOE_PROGRESS.md - B-3 (mmq_y-down warp-remap, patch 0028) checkpoint
Agent: B2-build (GPU agent). Base: 0025 tip (DGX `~/llama-paged-dev` `2f4f5ab`, branch `b-work`),
independent of the held hybrid 0026. Worktree: `.../feat+paged-attention`.
Agent: B3-or-assess (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev`
`2f4f5ab`, branch `b-work`), independent of the held hybrid 0026. Worktree: `.../feat+paged-attention`.
## The lever (B-2 / M1)
Bit-exact block/grid/occupancy retune of `quantize_mmq_nvfp4` (the MoE down_proj activation-quant,
~2% of the MoE decode step). `ggml/src/ggml-cuda/quantize.cu`, `quantize_mmq_fp4_cuda` NVFP4 branch.
## Prior: B-2 (act-quant retune) = NEGATIVE (no lift, no patch 0027). MoE ~85% of vLLM @npl128.
B-2 proved the act-quant tax (~2%) is already optimally tiled; the structural MoE residual is the
grouped FP4 `mul_mat_q<NVFP4>` GEMM (~27%, LPDDR5x BW floor) + bf16 projections (~10.5%). => try B-3.
## Why it is provably byte-identical
`quantize_mmq_nvfp4` maps thread -> column purely through the global linear index
`gy = blockDim.x*blockIdx.y + threadIdx.x` -> `i0_base = gy*QK_NVFP4_SUB`, with NO cross-thread
communication (no shared memory, no warp reduction) and every thread owning a disjoint output
sub-block (its own `sub` slot in `block_fp4_mmq`). So the (thread)->output-byte map - and thus the
produced bytes - are invariant to `blockDim.x` as long as `block_num_y` is recomputed from the SAME
`blockDim.x`. We retune ONLY `blockDim.x`; the per-thread quant body + writeback are untouched.
## The lever (B-3 / SPEEDUP_HUNT B rank #3)
mmq_y-down warp-remap of the NVFP4 FP4-MMA grouped GEMM `mul_mat_q<NVFP4>` in `ggml/.../mmq.cuh`.
mmq_y tiles the weight-row (N) dimension; lowering 128->64 raises resident CTAs (smaller per-CTA
shared + accumulator + 128 vs 256 threads/CTA => ~2x blocks/SM) to hide LPDDR5x weight-load latency,
WITHOUT re-reading weights (each weight row lives in exactly one row-tile => BW-neutral). The MoE
GEMM runs at ~35% of peak BW (occupancy-limited, NOT BW-saturated), so more resident CTAs is the
right mechanism - and it is the ONE untested occupancy lever (M-tile = NEUTRAL 0015, MINBLOCKS =
+8.7% slower 0017).
## Change
`static const int nvfp4_block_size` selected once via env `LLAMA_MOE_QUANT_BLOCK` (default 128 =
baseline; final = measured GB10 winner), `block_num_y` recomputed consistently. ~20 LOC, one TU.
## The coupling that makes it a real kernel change (not the 0017 knob alone)
The FP4-MMA path has `static_assert(nwarps*tile_C::I == mmq_y)` (mmq.cuh:3280; tile_C::I==16 for the
m16n8k64 block-scaled FP4 MMA). nwarps is global `256/warp_size = 8`, so mmq_y is pinned at 128. The
0017 `GGML_CUDA_FP4_MMQ_Y` knob alone would TRIP this assert at mmq_y=64. B-3 makes nwarps TYPE-AWARE:
`mmq_get_nwarps_device<type>()` returns mmq_y/16 = 4 for NVFP4-reduced (else stock 8), keeping the
coupling. 2 new overloads (device template + host 3-arg) + 9 call-site swaps to `<type>`. Default
GGML_CUDA_FP4_MMQ_Y==128 returns stock nwarps for EVERY type => default build byte-identical to stock.
## Status: COMPLETE - NEGATIVE (no lift). Full result in B_MOE_RESULTS.md.
- [x] Branched `b-work` off 0025 (`2f4f5ab`); patch applied to quantize.cu.
- [x] Build clean (llama-completion, llama-batched-bench, test-backend-ops). BUILD_EXIT=0.
- [x] md5 gate @block=128 (default): dense 5951a5b4 == ref, MoE 07db32c2 == ref. MUL_MAT 1146/1146,
MUL_MAT_ID 806/806 PASS.
- [x] BIT-EXACT proof across block sizes: block 64 AND 256 -> identical md5 both models.
- [x] Sweep block {32,64,96,128,160,192,256}: end-to-end FLAT (npl32 436-438, npl128 749-752, all
within 0.4% noise). NO block lifts decode.
- [x] nsys quantize_mmq_nvfp4: block=128 is the FASTEST (117.4M ns; 64 +8.7%, 192 +9.9%, 256 +6.9%).
128 already optimal => ZERO headroom.
- [x] DECISION: no patch 0027 (does not lift). Dev tree reverted to pristine 0025. Recommend B-3.
## Bit-exactness note (the real risk)
The per-output K-reduction order is mmq_y-INVARIANT (each output row owned by one thread). BUT mmq_y=64
DOUBLES nty (row-tiles), changing the stream-k kbc partition => an output tile's K-range may be split
across CTAs at different points and recombined by `mul_mat_q_stream_k_fixup` in a different grouping =>
FP non-associativity CAN perturb the last logit bits => greedy argmax COULD flip. So B-3 is NOT
bit-exact-by-construction in the md5 sense; the md5 gate is EMPIRICAL. md5 fail => not bit-exact => STOP.
## Status: COMPLETE - BIT-EXACT but FLAT. No patch 0028. Full result + assessment in B_MOE_RESULTS.md.
- [x] Source-read mmq.cuh: nwarps/mmq_y coupling, FP4 MMA vec_dot, kernel+fixup+launch+case sites.
- [x] Edited mmq.cuh: 2 nwarps overloads + 9 `<type>` swaps. git diff clean (37+/11-).
- [x] BEFORE baseline (stock-0025 binaries, same session): dense md5 5951a5b4==ref, moe 07db32c2==ref;
MoE S_TG npl32=441.98, npl128=756.47.
- [x] BUILD build-cuda @mmq_y=64 (full cuda rebuild): EXIT=0 - compiles (static_assert holds at 4*16=64).
- [x] md5 GATE PASS both models @64; test-backend-ops MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS.
- [x] Clean back-to-back A/B (build-cuda-base @128 vs build-cuda @64), 3 reps: npl32 +0.29%,
npl128 +0.40% - within the ~0.4% noise band. FLAT.
- [x] nsys A/B: grouped GEMM kernel mmq_y=64 -1.3% FASTER, BUT stream_k_fixup +42% costlier + SSM (40%)
dominant & untouched => end-to-end inert. BW-bound confirmed (same as 0015/0017/B-2).
- [x] DECIDED: FLAT -> no patch 0028. Dev tree reverted to pristine 0025 (no ggml diff), build-cuda
reconfigured to default + rebuilt. Bit-exact MoE ceiling = ~85% @npl128 / ~87.5% @npl32 of vLLM.
- [x] ASSESS + RECOMMEND (in B_MOE_RESULTS.md): residual = structural Marlin-NvFp4 grouped-GEMM gap,
uncloseable bit-exactly; fall back to 0026 bf16-SSM opt-in (default-off, fails MoE KL gate, ~95%).
## Gate references
- dense q36-27b-nvfp4 md5 == 5951a5b4d624ce891e22ab5fca9bc439

View File

@@ -88,3 +88,145 @@ hook + this measurement confirm 128 is the GB10 optimum, should other hardware e
Hand off to B-3 (patch 0028) as the next MoE GEMM lever.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
# B-3 (mmq_y-down warp-remap of the NVFP4 grouped GEMM) RESULT: BIT-EXACT but FLAT (no patch 0028)
Agent: B3-or-assess (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev` `2f4f5ab`,
branch `b-work`), independent of the held hybrid 0026. Lever: SPEEDUP_HUNT.md section B rank #3 - the
0017-deferred structural `mmq_y`-down warp-remap on the grouped FP4-MMA `mul_mat_q<NVFP4>` (the ~26-27%
MoE-specific GEMM), the only untested MoE GEMM occupancy lever.
## VERDICT
**Bit-exact (md5 PASS both models + test-backend-ops PASS), but end-to-end FLAT: npl128 +0.3-0.4%
(consistent direction, kernel-backed) and npl32 +0.1-0.3%, both inside the ~0.4% run-to-run band. The
warp-remap makes the grouped GEMM kernel ITSELF ~1.3% faster (occupancy DID rise) but the step is
BW/SSM-bound, so it does NOT lift MoE decode. No patch 0028.** MoE stays ~85% of vLLM @npl128.
## The change that was built+measured (bit-exact, then REVERTED)
`ggml/src/ggml-cuda/mmq.cuh`. The FP4-MMA path couples the weight-row tile to the warp count via the
invariant `static_assert(nwarps*tile_C::I == mmq_y)` (mmq.cuh:3280; `tile_C::I==16` for the m16n8k64
block-scaled FP4 MMA). `nwarps` is global `256/warp_size = 8`, pinning `mmq_y=128`; the 0017
`GGML_CUDA_FP4_MMQ_Y` knob alone would TRIP that assert at 64. B-3 makes nwarps TYPE-AWARE: a new
`mmq_get_nwarps_device<type>()` (+ 3-arg host overload) returns `mmq_y/16 = 4` for NVFP4-reduced (else
the stock 8), so `mmq_y=64 -> nwarps=4 -> 128 threads/CTA` (vs 256) -> ~2x resident CTAs. 2 overloads +
9 `<type>` call-site swaps (kernel, process_tile, write_back_mma, stream_k_fixup, nvfp4 loader, 2 host).
Built with `-DGGML_CUDA_FP4_MMQ_Y=64`; the compile SUCCEEDS (the static_assert now holds at 4*16=64).
**Default `GGML_CUDA_FP4_MMQ_Y==128` returns stock nwarps for every type => a default build is
byte-identical to stock** (the bit-exact opt-out, proven by the md5 below at 128).
### Bit-exactness is EMPIRICAL here (not by-construction)
The per-output K-reduction order is mmq_y-invariant (each output row owned by one thread), but mmq_y=64
DOUBLES `nty` (row-tiles), changing the stream-k `kbc` partition => an output tile's K-range can be
split across CTAs at different points and recombined by `mul_mat_q_stream_k_fixup` in a different
grouping => FP non-associativity COULD perturb the last logit bits and flip a greedy argmax. It did NOT
for the gate prompt (md5 matched), but B-3 is therefore NOT bit-exact-by-construction - a default-ON
ship would be a (small) precision risk. This is a second reason not to ship it for a 0% gain.
## GATE (bit-exact) - BOTH MODELS PASS
greedy `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`:
- dense q36-27b-nvfp4 = 5951a5b4d624ce891e22ab5fca9bc439 == ref
- MoE q36-35b-a3b-nvfp4 = 07db32c2bcb78d17a43ed18bc22705cd == ref
- test-backend-ops CUDA0: **MUL_MAT 1146/1146 PASS, MUL_MAT_ID 806/806 PASS.**
## MEASUREMENT 1 - end-to-end MoE decode_agg, clean BACK-TO-BACK A/B (build-cuda-base @128 vs build-cuda @64)
`llama-batched-bench -m q36-35b-a3b-nvfp4 -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128`, S_TG t/s,
3 reps alternating (no concurrent load):
| npl | mmq_y=128 (base) mean | mmq_y=64 (B-3) mean | delta |
|----:|----------------------:|--------------------:|------:|
| 32 | 437.6 (437.3-437.7) | 438.8 (438.4-439.1) | +0.29% |
| 128 | 750.1 (748.9-751.1) | 753.1 (753.0-753.4) | +0.40% |
Every B-3 rep edges the base by +0.3-0.4% @npl128 (consistent, kernel-backed), but the per-build spread
(base 748.9-751.1) OVERLAPS - it is at the edge of noise, NOT a meaningful lift. Caps the end-to-end win
at well under 1%, nowhere near the gap to vLLM (882).
## MEASUREMENT 2 - nsys kernel-level A/B (the meaningful B-3 evidence), clean, no concurrent load
`GGML_CUDA_DISABLE_GRAPHS=1 nsys ... -npp 4 -ntg 32 -npl 128`, decode-isolated window, `cuda_gpu_kern_sum`:
| kernel (% of window) | mmq_y=128 total ns | mmq_y=64 total ns | delta |
|---------------------------------|-------------------:|------------------:|-------:|
| gated_delta_net (SSM, ~40%) | 2,335,951,709 | 2,334,847,390 | 0.0% (untouched, DOMINANT) |
| **mul_mat_q<NVFP4, M-tile 64>** (MoE GEMM, ~26%) | **1,502,548,958** | **1,483,685,630** | **-1.26% (kernel faster)** |
| mul_mat_q<NVFP4, M-tile 128> (router, ~3.7%) | 224,532,704 | 210,885,920 | -6.1% |
| quantize_mmq_nvfp4 (act-quant, ~2%) | 119,118,624 | 118,718,496 | -0.3% |
| **mul_mat_q_stream_k_fixup<128>** (~0.6%) | **26,848,479** | **38,117,532** | **+42% (fixup COSTLIER)** |
The warp-remap DOES what it claims at the kernel level: the grouped GEMM is **-1.3%** (more resident
CTAs hide a sliver of weight-load latency). But (a) it is only ~26% of the step, (b) halving mmq_y
DOUBLES the row-tiles so the stream-k fixup recombination grows **+42%** (+11.3M ns), eating ~60% of the
GEMM's 18.9M-ns saving, and (c) the step is dominated by the gated_delta_net SSM (~40%, untouched, and
already PAST vLLM's BW efficiency per 0018-0022) with the GEMM itself at the LPDDR5x BW floor. Net
mul_mat region saving ~7.6M ns on a ~5.8B-ns window = ~0.13%; end-to-end +0.3-0.4% (within noise).
**This is the definitive BW-bound proof: even a real occupancy win on the target kernel does not move
end-to-end** - the same outcome as patch 0015 (M-tile NEUTRAL), 0017 (MINBLOCKS +8.7% slower), and B-2
(act-quant FLAT). The MoE grouped GEMM is bandwidth-limited, not occupancy-limited, at the kernel exit.
## DECISION
No patch 0028 (B-3 does not lift end-to-end; bit-exactness is empirical, not by-construction; the fixup
penalty + BW floor swamp the +1.3% kernel win). Dev tree reverted to pristine 0025 (no ggml diff),
build-cuda reconfigured to default (no flag) and rebuilt. The `mmq_get_nwarps_device<type>()` remap is a
correct, reusable warp-remap should occupancy-bound FP4 hardware ever appear; it is inert on GB10.
---
# FINAL ASSESSMENT - the honest bit-exact MoE ceiling, and the recommendation
## The bit-exact MoE GEMM/launch track is now EXHAUSTED
| MoE lever (bit-exact) | result | MoE decode_agg @npl128 |
|-----------------------|--------|------------------------|
| 0025 re-graph (B-1, LANDED) | the ONLY bit-exact MoE win | ~82% -> **~85%** of vLLM |
| B-2 act-quant retune (no patch) | FLAT (128 already optimal) | +0% |
| B-3 mmq_y-down warp-remap (no patch) | FLAT (kernel -1.3%, e2e +0.3% noise) | +0% |
**Honest bit-exact MoE ceiling on GB10 = ~85% of vLLM @npl128 (753 / 882.2), ~87.5% @npl32 (439 / 500.8).**
B-1 (re-graph, in 0025) banked the move from ~82% to ~85%; B-2 and B-3 each add 0. The grouped-GEMM/
launch track has no remaining bit-exact headroom.
## Is the residual the structural Marlin-MoE gap? YES.
The remaining ~15% is structural and uncloseable bit-exactly, decomposed from the nsys:
- **Grouped FP4 GEMM (~26%) is at the LPDDR5x BW floor.** B-3 proved an occupancy win there is
end-to-end-inert. vLLM ships a purpose-built **Marlin-NvFp4** grouped GEMM (a different, more
bandwidth-efficient schedule); llama runs native FP4-MMA W4A4 (a HIGHER arithmetic tier, but the
decode shape is BW-bound so the tier does not help). This is THE structural gap and matches
FP4_GEMM_SCOPE_B.md's "MoE ceiling ~76% from the GEMM track alone."
- **The SSM recurrence (~40%) is already PAST vLLM** (84.6% vs 82.4% peak BW, 0018-0022) - not a lever.
- **bf16 projections (~10.5%)** - both engines pay similar; not a bit-exact lever.
No bit-exact lever closes the structural grouped-GEMM gap. ~85% is the honest bit-exact MoE plateau.
## RECOMMENDATION: ship the bit-exact ~85% as DEFAULT; expose 0026 bf16-SSM as a documented opt-in for the last ~10% on MoE (NOT default, NOT in the recommended config)
Per the user's decision rule ("pursue B first; if it cannot reach/beat vLLM on MoE, fall back to the
held hybrid/bf16 opt-in"): **B (bit-exact) cannot reach vLLM on MoE (~85%), so the fallback applies -
but with a hard caveat the team must carry.**
1. **DEFAULT = the bit-exact plateau (0025 with the re-graph), MoE ~85% of vLLM.** This is the honest,
precision-safe ship: the recurrence already BEATS vLLM's BW efficiency, the GEMM is the same FP4
arithmetic class, and the output is byte-identical to the f32 reference. Do not claim MoE *parity*
bit-exactly - claim ~85% with a precision profile at-or-above vLLM.
2. **FALLBACK (opt-in only) = 0026 hybrid bf16-SSM.** It is the ONLY remaining MoE lever (it speeds the
~40% recurrence, the part B does not touch): measured **+11.5% MoE decode** (1110.7 -> 1238.1 t/s in
the 0026 harness) -> would lift MoE ~85% -> **~95% of vLLM**. BUT: (a) it is **non-bit-exact**; (b) it
**FAILS the MoE KL ship-gate by a wide margin** (MeanKLD ~0.045 / Same-top-p ~91% vs the 1e-3 / 99.5%
bar - the gated-DeltaNet state is hypersensitive to bf16; A_HYBRID_SSM_RESULTS.md: "MoE has NO low-KL
regime ... Do NOT put a hybrid T in the gallery/recommended config"); and (c) even then it reaches
**~95%, not a clean beat** of vLLM, while conceding precision vLLM keeps (all-f32 SSM state).
=> Ship 0026 default-OFF (`ssm_hybrid_tau_thresh = 0` / no `--ssm-bf16-tau`); expose the bf16-SSM as
an EXPLICIT opt-in flag for callers who knowingly accept a real MoE precision regression for ~+11.5%
decode (~95% of vLLM). Keep it OUT of the gallery/recommended MoE config.
**Bottom line for the parent:** bit-exact MoE on GB10 plateaus at **~85% of vLLM** and the residual is
the structural Marlin-NvFp4 grouped-GEMM gap that NO bit-exact lever closes (B-1 banked the re-graph;
B-2 and B-3 are 0). Bit-exact does NOT reach/beat vLLM on MoE. The only lever that closes more (to ~95%)
is the held 0026 bf16-SSM, which is **non-bit-exact AND fails the MoE KL gate** - so it ships **opt-in,
default-off, not in the recommended config**, not as the default. Recommend shipping the honest ~85%
bit-exact default and documenting the opt-in for users who accept the precision tradeoff. Do not market
MoE parity; the bit-exact default is ~85% with a precision profile at-or-above vLLM, which is the
defensible claim.
Assisted-by: Claude:opus-4.8 [Claude Code]