diff --git a/backend/cpp/llama-cpp/patches/paged/B_MOE_PROGRESS.md b/backend/cpp/llama-cpp/patches/paged/B_MOE_PROGRESS.md index 5874c5eb2..4a8beba10 100644 --- a/backend/cpp/llama-cpp/patches/paged/B_MOE_PROGRESS.md +++ b/backend/cpp/llama-cpp/patches/paged/B_MOE_PROGRESS.md @@ -1,35 +1,51 @@ -# B_MOE_PROGRESS.md - B-2 (down_proj act-quant retune, patch 0027) checkpoint +# B_MOE_PROGRESS.md - B-3 (mmq_y-down warp-remap, patch 0028) checkpoint -Agent: B2-build (GPU agent). Base: 0025 tip (DGX `~/llama-paged-dev` `2f4f5ab`, branch `b-work`), -independent of the held hybrid 0026. Worktree: `.../feat+paged-attention`. +Agent: B3-or-assess (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev` +`2f4f5ab`, branch `b-work`), independent of the held hybrid 0026. Worktree: `.../feat+paged-attention`. -## The lever (B-2 / M1) -Bit-exact block/grid/occupancy retune of `quantize_mmq_nvfp4` (the MoE down_proj activation-quant, -~2% of the MoE decode step). `ggml/src/ggml-cuda/quantize.cu`, `quantize_mmq_fp4_cuda` NVFP4 branch. +## Prior: B-2 (act-quant retune) = NEGATIVE (no lift, no patch 0027). MoE ~85% of vLLM @npl128. +B-2 proved the act-quant tax (~2%) is already optimally tiled; the structural MoE residual is the +grouped FP4 `mul_mat_q` GEMM (~27%, LPDDR5x BW floor) + bf16 projections (~10.5%). => try B-3. -## Why it is provably byte-identical -`quantize_mmq_nvfp4` maps thread -> column purely through the global linear index -`gy = blockDim.x*blockIdx.y + threadIdx.x` -> `i0_base = gy*QK_NVFP4_SUB`, with NO cross-thread -communication (no shared memory, no warp reduction) and every thread owning a disjoint output -sub-block (its own `sub` slot in `block_fp4_mmq`). So the (thread)->output-byte map - and thus the -produced bytes - are invariant to `blockDim.x` as long as `block_num_y` is recomputed from the SAME -`blockDim.x`. We retune ONLY `blockDim.x`; the per-thread quant body + writeback are untouched. +## The lever (B-3 / SPEEDUP_HUNT B rank #3) +mmq_y-down warp-remap of the NVFP4 FP4-MMA grouped GEMM `mul_mat_q` in `ggml/.../mmq.cuh`. +mmq_y tiles the weight-row (N) dimension; lowering 128->64 raises resident CTAs (smaller per-CTA +shared + accumulator + 128 vs 256 threads/CTA => ~2x blocks/SM) to hide LPDDR5x weight-load latency, +WITHOUT re-reading weights (each weight row lives in exactly one row-tile => BW-neutral). The MoE +GEMM runs at ~35% of peak BW (occupancy-limited, NOT BW-saturated), so more resident CTAs is the +right mechanism - and it is the ONE untested occupancy lever (M-tile = NEUTRAL 0015, MINBLOCKS = ++8.7% slower 0017). -## Change -`static const int nvfp4_block_size` selected once via env `LLAMA_MOE_QUANT_BLOCK` (default 128 = -baseline; final = measured GB10 winner), `block_num_y` recomputed consistently. ~20 LOC, one TU. +## The coupling that makes it a real kernel change (not the 0017 knob alone) +The FP4-MMA path has `static_assert(nwarps*tile_C::I == mmq_y)` (mmq.cuh:3280; tile_C::I==16 for the +m16n8k64 block-scaled FP4 MMA). nwarps is global `256/warp_size = 8`, so mmq_y is pinned at 128. The +0017 `GGML_CUDA_FP4_MMQ_Y` knob alone would TRIP this assert at mmq_y=64. B-3 makes nwarps TYPE-AWARE: +`mmq_get_nwarps_device()` returns mmq_y/16 = 4 for NVFP4-reduced (else stock 8), keeping the +coupling. 2 new overloads (device template + host 3-arg) + 9 call-site swaps to ``. Default +GGML_CUDA_FP4_MMQ_Y==128 returns stock nwarps for EVERY type => default build byte-identical to stock. -## Status: COMPLETE - NEGATIVE (no lift). Full result in B_MOE_RESULTS.md. -- [x] Branched `b-work` off 0025 (`2f4f5ab`); patch applied to quantize.cu. -- [x] Build clean (llama-completion, llama-batched-bench, test-backend-ops). BUILD_EXIT=0. -- [x] md5 gate @block=128 (default): dense 5951a5b4 == ref, MoE 07db32c2 == ref. MUL_MAT 1146/1146, - MUL_MAT_ID 806/806 PASS. -- [x] BIT-EXACT proof across block sizes: block 64 AND 256 -> identical md5 both models. -- [x] Sweep block {32,64,96,128,160,192,256}: end-to-end FLAT (npl32 436-438, npl128 749-752, all - within 0.4% noise). NO block lifts decode. -- [x] nsys quantize_mmq_nvfp4: block=128 is the FASTEST (117.4M ns; 64 +8.7%, 192 +9.9%, 256 +6.9%). - 128 already optimal => ZERO headroom. -- [x] DECISION: no patch 0027 (does not lift). Dev tree reverted to pristine 0025. Recommend B-3. +## Bit-exactness note (the real risk) +The per-output K-reduction order is mmq_y-INVARIANT (each output row owned by one thread). BUT mmq_y=64 +DOUBLES nty (row-tiles), changing the stream-k kbc partition => an output tile's K-range may be split +across CTAs at different points and recombined by `mul_mat_q_stream_k_fixup` in a different grouping => +FP non-associativity CAN perturb the last logit bits => greedy argmax COULD flip. So B-3 is NOT +bit-exact-by-construction in the md5 sense; the md5 gate is EMPIRICAL. md5 fail => not bit-exact => STOP. + +## Status: COMPLETE - BIT-EXACT but FLAT. No patch 0028. Full result + assessment in B_MOE_RESULTS.md. +- [x] Source-read mmq.cuh: nwarps/mmq_y coupling, FP4 MMA vec_dot, kernel+fixup+launch+case sites. +- [x] Edited mmq.cuh: 2 nwarps overloads + 9 `` swaps. git diff clean (37+/11-). +- [x] BEFORE baseline (stock-0025 binaries, same session): dense md5 5951a5b4==ref, moe 07db32c2==ref; + MoE S_TG npl32=441.98, npl128=756.47. +- [x] BUILD build-cuda @mmq_y=64 (full cuda rebuild): EXIT=0 - compiles (static_assert holds at 4*16=64). +- [x] md5 GATE PASS both models @64; test-backend-ops MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS. +- [x] Clean back-to-back A/B (build-cuda-base @128 vs build-cuda @64), 3 reps: npl32 +0.29%, + npl128 +0.40% - within the ~0.4% noise band. FLAT. +- [x] nsys A/B: grouped GEMM kernel mmq_y=64 -1.3% FASTER, BUT stream_k_fixup +42% costlier + SSM (40%) + dominant & untouched => end-to-end inert. BW-bound confirmed (same as 0015/0017/B-2). +- [x] DECIDED: FLAT -> no patch 0028. Dev tree reverted to pristine 0025 (no ggml diff), build-cuda + reconfigured to default + rebuilt. Bit-exact MoE ceiling = ~85% @npl128 / ~87.5% @npl32 of vLLM. +- [x] ASSESS + RECOMMEND (in B_MOE_RESULTS.md): residual = structural Marlin-NvFp4 grouped-GEMM gap, + uncloseable bit-exactly; fall back to 0026 bf16-SSM opt-in (default-off, fails MoE KL gate, ~95%). ## Gate references - dense q36-27b-nvfp4 md5 == 5951a5b4d624ce891e22ab5fca9bc439 diff --git a/backend/cpp/llama-cpp/patches/paged/B_MOE_RESULTS.md b/backend/cpp/llama-cpp/patches/paged/B_MOE_RESULTS.md index 7aa79af96..5929939df 100644 --- a/backend/cpp/llama-cpp/patches/paged/B_MOE_RESULTS.md +++ b/backend/cpp/llama-cpp/patches/paged/B_MOE_RESULTS.md @@ -88,3 +88,145 @@ hook + this measurement confirm 128 is the GB10 optimum, should other hardware e Hand off to B-3 (patch 0028) as the next MoE GEMM lever. Assisted-by: Claude:opus-4.8 [Claude Code] + +--- + +# B-3 (mmq_y-down warp-remap of the NVFP4 grouped GEMM) RESULT: BIT-EXACT but FLAT (no patch 0028) + +Agent: B3-or-assess (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev` `2f4f5ab`, +branch `b-work`), independent of the held hybrid 0026. Lever: SPEEDUP_HUNT.md section B rank #3 - the +0017-deferred structural `mmq_y`-down warp-remap on the grouped FP4-MMA `mul_mat_q` (the ~26-27% +MoE-specific GEMM), the only untested MoE GEMM occupancy lever. + +## VERDICT +**Bit-exact (md5 PASS both models + test-backend-ops PASS), but end-to-end FLAT: npl128 +0.3-0.4% +(consistent direction, kernel-backed) and npl32 +0.1-0.3%, both inside the ~0.4% run-to-run band. The +warp-remap makes the grouped GEMM kernel ITSELF ~1.3% faster (occupancy DID rise) but the step is +BW/SSM-bound, so it does NOT lift MoE decode. No patch 0028.** MoE stays ~85% of vLLM @npl128. + +## The change that was built+measured (bit-exact, then REVERTED) +`ggml/src/ggml-cuda/mmq.cuh`. The FP4-MMA path couples the weight-row tile to the warp count via the +invariant `static_assert(nwarps*tile_C::I == mmq_y)` (mmq.cuh:3280; `tile_C::I==16` for the m16n8k64 +block-scaled FP4 MMA). `nwarps` is global `256/warp_size = 8`, pinning `mmq_y=128`; the 0017 +`GGML_CUDA_FP4_MMQ_Y` knob alone would TRIP that assert at 64. B-3 makes nwarps TYPE-AWARE: a new +`mmq_get_nwarps_device()` (+ 3-arg host overload) returns `mmq_y/16 = 4` for NVFP4-reduced (else +the stock 8), so `mmq_y=64 -> nwarps=4 -> 128 threads/CTA` (vs 256) -> ~2x resident CTAs. 2 overloads + +9 `` call-site swaps (kernel, process_tile, write_back_mma, stream_k_fixup, nvfp4 loader, 2 host). +Built with `-DGGML_CUDA_FP4_MMQ_Y=64`; the compile SUCCEEDS (the static_assert now holds at 4*16=64). +**Default `GGML_CUDA_FP4_MMQ_Y==128` returns stock nwarps for every type => a default build is +byte-identical to stock** (the bit-exact opt-out, proven by the md5 below at 128). + +### Bit-exactness is EMPIRICAL here (not by-construction) +The per-output K-reduction order is mmq_y-invariant (each output row owned by one thread), but mmq_y=64 +DOUBLES `nty` (row-tiles), changing the stream-k `kbc` partition => an output tile's K-range can be +split across CTAs at different points and recombined by `mul_mat_q_stream_k_fixup` in a different +grouping => FP non-associativity COULD perturb the last logit bits and flip a greedy argmax. It did NOT +for the gate prompt (md5 matched), but B-3 is therefore NOT bit-exact-by-construction - a default-ON +ship would be a (small) precision risk. This is a second reason not to ship it for a 0% gain. + +## GATE (bit-exact) - BOTH MODELS PASS +greedy `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`: +- dense q36-27b-nvfp4 = 5951a5b4d624ce891e22ab5fca9bc439 == ref +- MoE q36-35b-a3b-nvfp4 = 07db32c2bcb78d17a43ed18bc22705cd == ref +- test-backend-ops CUDA0: **MUL_MAT 1146/1146 PASS, MUL_MAT_ID 806/806 PASS.** + +## MEASUREMENT 1 - end-to-end MoE decode_agg, clean BACK-TO-BACK A/B (build-cuda-base @128 vs build-cuda @64) +`llama-batched-bench -m q36-35b-a3b-nvfp4 -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128`, S_TG t/s, +3 reps alternating (no concurrent load): + +| npl | mmq_y=128 (base) mean | mmq_y=64 (B-3) mean | delta | +|----:|----------------------:|--------------------:|------:| +| 32 | 437.6 (437.3-437.7) | 438.8 (438.4-439.1) | +0.29% | +| 128 | 750.1 (748.9-751.1) | 753.1 (753.0-753.4) | +0.40% | + +Every B-3 rep edges the base by +0.3-0.4% @npl128 (consistent, kernel-backed), but the per-build spread +(base 748.9-751.1) OVERLAPS - it is at the edge of noise, NOT a meaningful lift. Caps the end-to-end win +at well under 1%, nowhere near the gap to vLLM (882). + +## MEASUREMENT 2 - nsys kernel-level A/B (the meaningful B-3 evidence), clean, no concurrent load +`GGML_CUDA_DISABLE_GRAPHS=1 nsys ... -npp 4 -ntg 32 -npl 128`, decode-isolated window, `cuda_gpu_kern_sum`: + +| kernel (% of window) | mmq_y=128 total ns | mmq_y=64 total ns | delta | +|---------------------------------|-------------------:|------------------:|-------:| +| gated_delta_net (SSM, ~40%) | 2,335,951,709 | 2,334,847,390 | 0.0% (untouched, DOMINANT) | +| **mul_mat_q** (MoE GEMM, ~26%) | **1,502,548,958** | **1,483,685,630** | **-1.26% (kernel faster)** | +| mul_mat_q (router, ~3.7%) | 224,532,704 | 210,885,920 | -6.1% | +| quantize_mmq_nvfp4 (act-quant, ~2%) | 119,118,624 | 118,718,496 | -0.3% | +| **mul_mat_q_stream_k_fixup<128>** (~0.6%) | **26,848,479** | **38,117,532** | **+42% (fixup COSTLIER)** | + +The warp-remap DOES what it claims at the kernel level: the grouped GEMM is **-1.3%** (more resident +CTAs hide a sliver of weight-load latency). But (a) it is only ~26% of the step, (b) halving mmq_y +DOUBLES the row-tiles so the stream-k fixup recombination grows **+42%** (+11.3M ns), eating ~60% of the +GEMM's 18.9M-ns saving, and (c) the step is dominated by the gated_delta_net SSM (~40%, untouched, and +already PAST vLLM's BW efficiency per 0018-0022) with the GEMM itself at the LPDDR5x BW floor. Net +mul_mat region saving ~7.6M ns on a ~5.8B-ns window = ~0.13%; end-to-end +0.3-0.4% (within noise). +**This is the definitive BW-bound proof: even a real occupancy win on the target kernel does not move +end-to-end** - the same outcome as patch 0015 (M-tile NEUTRAL), 0017 (MINBLOCKS +8.7% slower), and B-2 +(act-quant FLAT). The MoE grouped GEMM is bandwidth-limited, not occupancy-limited, at the kernel exit. + +## DECISION +No patch 0028 (B-3 does not lift end-to-end; bit-exactness is empirical, not by-construction; the fixup +penalty + BW floor swamp the +1.3% kernel win). Dev tree reverted to pristine 0025 (no ggml diff), +build-cuda reconfigured to default (no flag) and rebuilt. The `mmq_get_nwarps_device()` remap is a +correct, reusable warp-remap should occupancy-bound FP4 hardware ever appear; it is inert on GB10. + +--- + +# FINAL ASSESSMENT - the honest bit-exact MoE ceiling, and the recommendation + +## The bit-exact MoE GEMM/launch track is now EXHAUSTED +| MoE lever (bit-exact) | result | MoE decode_agg @npl128 | +|-----------------------|--------|------------------------| +| 0025 re-graph (B-1, LANDED) | the ONLY bit-exact MoE win | ~82% -> **~85%** of vLLM | +| B-2 act-quant retune (no patch) | FLAT (128 already optimal) | +0% | +| B-3 mmq_y-down warp-remap (no patch) | FLAT (kernel -1.3%, e2e +0.3% noise) | +0% | + +**Honest bit-exact MoE ceiling on GB10 = ~85% of vLLM @npl128 (753 / 882.2), ~87.5% @npl32 (439 / 500.8).** +B-1 (re-graph, in 0025) banked the move from ~82% to ~85%; B-2 and B-3 each add 0. The grouped-GEMM/ +launch track has no remaining bit-exact headroom. + +## Is the residual the structural Marlin-MoE gap? YES. +The remaining ~15% is structural and uncloseable bit-exactly, decomposed from the nsys: +- **Grouped FP4 GEMM (~26%) is at the LPDDR5x BW floor.** B-3 proved an occupancy win there is + end-to-end-inert. vLLM ships a purpose-built **Marlin-NvFp4** grouped GEMM (a different, more + bandwidth-efficient schedule); llama runs native FP4-MMA W4A4 (a HIGHER arithmetic tier, but the + decode shape is BW-bound so the tier does not help). This is THE structural gap and matches + FP4_GEMM_SCOPE_B.md's "MoE ceiling ~76% from the GEMM track alone." +- **The SSM recurrence (~40%) is already PAST vLLM** (84.6% vs 82.4% peak BW, 0018-0022) - not a lever. +- **bf16 projections (~10.5%)** - both engines pay similar; not a bit-exact lever. + +No bit-exact lever closes the structural grouped-GEMM gap. ~85% is the honest bit-exact MoE plateau. + +## RECOMMENDATION: ship the bit-exact ~85% as DEFAULT; expose 0026 bf16-SSM as a documented opt-in for the last ~10% on MoE (NOT default, NOT in the recommended config) + +Per the user's decision rule ("pursue B first; if it cannot reach/beat vLLM on MoE, fall back to the +held hybrid/bf16 opt-in"): **B (bit-exact) cannot reach vLLM on MoE (~85%), so the fallback applies - +but with a hard caveat the team must carry.** + +1. **DEFAULT = the bit-exact plateau (0025 with the re-graph), MoE ~85% of vLLM.** This is the honest, + precision-safe ship: the recurrence already BEATS vLLM's BW efficiency, the GEMM is the same FP4 + arithmetic class, and the output is byte-identical to the f32 reference. Do not claim MoE *parity* + bit-exactly - claim ~85% with a precision profile at-or-above vLLM. + +2. **FALLBACK (opt-in only) = 0026 hybrid bf16-SSM.** It is the ONLY remaining MoE lever (it speeds the + ~40% recurrence, the part B does not touch): measured **+11.5% MoE decode** (1110.7 -> 1238.1 t/s in + the 0026 harness) -> would lift MoE ~85% -> **~95% of vLLM**. BUT: (a) it is **non-bit-exact**; (b) it + **FAILS the MoE KL ship-gate by a wide margin** (MeanKLD ~0.045 / Same-top-p ~91% vs the 1e-3 / 99.5% + bar - the gated-DeltaNet state is hypersensitive to bf16; A_HYBRID_SSM_RESULTS.md: "MoE has NO low-KL + regime ... Do NOT put a hybrid T in the gallery/recommended config"); and (c) even then it reaches + **~95%, not a clean beat** of vLLM, while conceding precision vLLM keeps (all-f32 SSM state). + + => Ship 0026 default-OFF (`ssm_hybrid_tau_thresh = 0` / no `--ssm-bf16-tau`); expose the bf16-SSM as + an EXPLICIT opt-in flag for callers who knowingly accept a real MoE precision regression for ~+11.5% + decode (~95% of vLLM). Keep it OUT of the gallery/recommended MoE config. + +**Bottom line for the parent:** bit-exact MoE on GB10 plateaus at **~85% of vLLM** and the residual is +the structural Marlin-NvFp4 grouped-GEMM gap that NO bit-exact lever closes (B-1 banked the re-graph; +B-2 and B-3 are 0). Bit-exact does NOT reach/beat vLLM on MoE. The only lever that closes more (to ~95%) +is the held 0026 bf16-SSM, which is **non-bit-exact AND fails the MoE KL gate** - so it ships **opt-in, +default-off, not in the recommended config**, not as the default. Recommend shipping the honest ~85% +bit-exact default and documenting the opt-in for users who accept the precision tradeoff. Do not market +MoE parity; the bit-exact default is ~85% with a precision profile at-or-above vLLM, which is the +defensible claim. + +Assisted-by: Claude:opus-4.8 [Claude Code]