From 1f857f179ecdccf33581fc54acdb4c9a470ca033 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Fri, 26 Jun 2026 18:31:51 +0000 Subject: [PATCH] docs(paged): B-2 down_proj act-quant retune RESULT - negative (no headroom) B-2 / M1 (SPEEDUP_HUNT rank #2): bit-exact block/grid/occupancy retune of quantize_mmq_nvfp4 (the MoE down_proj activation-quant, ~2% of the MoE decode step). Built+measured on a clean 0025 base (DGX GB10 sm_121), then reverted - it does not lift. Finding: the existing blockDim.x=128 is ALREADY the kernel-level optimum for quantize_mmq_nvfp4 on GB10. nsys (8193 invocations): block=128 total 117.4M ns is the fastest; 64 +8.7%, 192 +9.9%, 256 +6.9%. End-to-end MoE decode_agg is flat within 0.4% noise across all block sizes {32..256} (npl32 ~438, npl128 ~751 t/s). The act-quant is ~2% of a BW-bound step, so even a perfect kernel caps the win at ~2%, and 128 is already optimal => measured 0%. Same outcome as patch 0015 (M-tile) and 0017 (MINBLOCKS): no occupancy headroom on this 256-tiny-expert BW-bound model. Bit-exactness proven: md5 identical at block 64/128/256 for both models (the per-thread quant body is untouched; thread->output map is invariant to blockDim.x). Gate at default: dense 5951a5b4 == ref, MoE 07db32c2 == ref, MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS. MoE stays ~85% of vLLM @npl128 / ~87% @npl32 - still well below vLLM, so the remaining MoE lever is B-3 (mmq_y-down warp-remap on the grouped FP4 GEMM). No patch 0027; dev tree reverted to pristine 0025. Full data in B_MOE_RESULTS.md. Signed-off-by: Ettore Di Giacinto Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../llama-cpp/patches/paged/B_MOE_PROGRESS.md | 41 +++++++++ .../llama-cpp/patches/paged/B_MOE_RESULTS.md | 90 +++++++++++++++++++ 2 files changed, 131 insertions(+) create mode 100644 backend/cpp/llama-cpp/patches/paged/B_MOE_PROGRESS.md create mode 100644 backend/cpp/llama-cpp/patches/paged/B_MOE_RESULTS.md diff --git a/backend/cpp/llama-cpp/patches/paged/B_MOE_PROGRESS.md b/backend/cpp/llama-cpp/patches/paged/B_MOE_PROGRESS.md new file mode 100644 index 000000000..5874c5eb2 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/B_MOE_PROGRESS.md @@ -0,0 +1,41 @@ +# B_MOE_PROGRESS.md - B-2 (down_proj act-quant retune, patch 0027) checkpoint + +Agent: B2-build (GPU agent). Base: 0025 tip (DGX `~/llama-paged-dev` `2f4f5ab`, branch `b-work`), +independent of the held hybrid 0026. Worktree: `.../feat+paged-attention`. + +## The lever (B-2 / M1) +Bit-exact block/grid/occupancy retune of `quantize_mmq_nvfp4` (the MoE down_proj activation-quant, +~2% of the MoE decode step). `ggml/src/ggml-cuda/quantize.cu`, `quantize_mmq_fp4_cuda` NVFP4 branch. + +## Why it is provably byte-identical +`quantize_mmq_nvfp4` maps thread -> column purely through the global linear index +`gy = blockDim.x*blockIdx.y + threadIdx.x` -> `i0_base = gy*QK_NVFP4_SUB`, with NO cross-thread +communication (no shared memory, no warp reduction) and every thread owning a disjoint output +sub-block (its own `sub` slot in `block_fp4_mmq`). So the (thread)->output-byte map - and thus the +produced bytes - are invariant to `blockDim.x` as long as `block_num_y` is recomputed from the SAME +`blockDim.x`. We retune ONLY `blockDim.x`; the per-thread quant body + writeback are untouched. + +## Change +`static const int nvfp4_block_size` selected once via env `LLAMA_MOE_QUANT_BLOCK` (default 128 = +baseline; final = measured GB10 winner), `block_num_y` recomputed consistently. ~20 LOC, one TU. + +## Status: COMPLETE - NEGATIVE (no lift). Full result in B_MOE_RESULTS.md. +- [x] Branched `b-work` off 0025 (`2f4f5ab`); patch applied to quantize.cu. +- [x] Build clean (llama-completion, llama-batched-bench, test-backend-ops). BUILD_EXIT=0. +- [x] md5 gate @block=128 (default): dense 5951a5b4 == ref, MoE 07db32c2 == ref. MUL_MAT 1146/1146, + MUL_MAT_ID 806/806 PASS. +- [x] BIT-EXACT proof across block sizes: block 64 AND 256 -> identical md5 both models. +- [x] Sweep block {32,64,96,128,160,192,256}: end-to-end FLAT (npl32 436-438, npl128 749-752, all + within 0.4% noise). NO block lifts decode. +- [x] nsys quantize_mmq_nvfp4: block=128 is the FASTEST (117.4M ns; 64 +8.7%, 192 +9.9%, 256 +6.9%). + 128 already optimal => ZERO headroom. +- [x] DECISION: no patch 0027 (does not lift). Dev tree reverted to pristine 0025. Recommend B-3. + +## Gate references +- dense q36-27b-nvfp4 md5 == 5951a5b4d624ce891e22ab5fca9bc439 +- MoE q36-35b-a3b-nvfp4 md5 == 07db32c2bcb78d17a43ed18bc22705cd +- gate cmd: `llama-completion -m M -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1` +- bench: `llama-batched-bench -m M -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128` (S_TG=decode_agg) +- vLLM ref decode_agg @npl128 = 882.2 t/s (npl32 ref 500.8). + +Assisted-by: Claude:opus-4.8 [Claude Code] diff --git a/backend/cpp/llama-cpp/patches/paged/B_MOE_RESULTS.md b/backend/cpp/llama-cpp/patches/paged/B_MOE_RESULTS.md new file mode 100644 index 000000000..7aa79af96 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/B_MOE_RESULTS.md @@ -0,0 +1,90 @@ +# B_MOE_RESULTS.md - B-2 (down_proj act-quant retune / M1) RESULT: NEGATIVE (no headroom) + +Agent: B2-build (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev` `2f4f5ab`, +branch `b-work`), independent of the held hybrid 0026 (`33e7c65`). Lever: SPEEDUP_HUNT.md section B, +rank #2 ("down_proj act-quant retune (M1): bit-exact, bounded - act-quant is ~2% of MoE step"). + +## VERDICT +**The existing `blockDim.x = 128` is ALREADY the kernel-level optimum for `quantize_mmq_nvfp4` on +GB10 sm_121. B-2 has zero headroom: there is nothing to bake (128 is the current default), and it +does NOT lift MoE decode (end-to-end flat within 0.4% noise across all block sizes). No patch 0027.** +MoE stays ~85% of vLLM @npl128 / ~87% @npl32, well below vLLM => the remaining MoE lever is B-3. + +## The change that was built+measured (bit-exact, then REVERTED - did not lift) +`ggml/src/ggml-cuda/quantize.cu`, `quantize_mmq_fp4_cuda` NVFP4 branch. Replaced the hardcoded +`constexpr int nvfp4_block_size = 128` with a `static const int` selected once from env +`LLAMA_MOE_QUANT_BLOCK` (default 128), `block_num_y` recomputed from the SAME `blockDim.x`. ~20 LOC. + +### Why ANY block size is provably byte-identical (the bit-exact invariant) +`quantize_mmq_nvfp4` maps thread -> column purely via the global linear index +`gy = blockDim.x*blockIdx.y + threadIdx.x` -> `i0_base = gy*QK_NVFP4_SUB`, with NO cross-thread +communication (no shared memory, no warp reduction) and every thread writing its OWN disjoint output +sub-block (its own `sub` slot in `block_fp4_mmq`: `yqs[2*sub+0/1]`, `d4[sub]`). The per-thread quant +body (amax, the 5-offset fp8-code search, the q0/q1 nibble packing, the writeback) is untouched. So +the (thread)->output-byte map - and the produced bytes - are invariant to `blockDim.x`. Confirmed +empirically: md5 identical at block 64, 128, AND 256, both models. + +## GATE (bit-exact) - BOTH MODELS PASS at default AND at non-128 blocks +greedy `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`: + +| block | dense q36-27b-nvfp4 md5 | MoE q36-35b-a3b-nvfp4 md5 | +|------:|-------------------------|---------------------------| +| 128 (default) | 5951a5b4d624ce891e22ab5fca9bc439 == ref | 07db32c2bcb78d17a43ed18bc22705cd == ref | +| 64 | 5951a5b4...439 == ref | 07db32c2...5cd == ref | +| 256 | 5951a5b4...439 == ref | 07db32c2...5cd == ref | + +test-backend-ops (CUDA0): **MUL_MAT 1146/1146 PASS**, **MUL_MAT_ID 806/806 PASS**. + +## MEASUREMENT 1 - end-to-end MoE decode_agg (S_TG t/s), the actual throughput +`llama-batched-bench -m q36-35b-a3b-nvfp4.gguf -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128`, +1 rep/block (run-to-run noise ~0.3-0.5%): + +| block | npl=32 S_TG | npl=128 S_TG | +|------:|------------:|-------------:| +| 32 | 437.54 | 750.41 | +| 64 | 437.82 | 751.68 | +| 96 | 437.69 | 749.46 | +| **128 (base/default)** | **438.14** | **751.76** | +| 160 | 436.38 | 750.99 | +| 192 | 436.81 | 751.61 | +| 256 | 437.77 | 750.14 | + +Spread: npl32 = 1.76 t/s (0.4%), npl128 = 2.3 t/s (0.3%) - all within noise. **No block size lifts +end-to-end decode.** Expected: the act-quant is ~2% of the MoE step, so even a perfect (0 ns) quantize +kernel caps the end-to-end win at ~2%, and 128 is already optimal => measured 0%. + +## MEASUREMENT 2 - nsys kernel-level delta on quantize_mmq_nvfp4 (the meaningful B-2 metric) +`nsys --report cuda_gpu_kern_sum`, MoE, `GGML_CUDA_DISABLE_GRAPHS=1 -npp 4 -ntg 32 -npl 128`, +8,193 kernel invocations (the kernel is 2.0-2.2% of GPU time in this decode-heavy window): + +| block | total ns | avg ns | median ns | vs 128 (total) | +|------:|---------:|-------:|----------:|---------------:| +| 64 | 127,523,328 | 15,564.9 | 12,256 | +8.7% slower | +| **128 (default)** | **117,371,424** | **14,325.8** | **11,488** | baseline (fastest) | +| 192 | 128,970,464 | 15,741.5 | 12,032 | +9.9% slower | +| 256 | 125,422,048 | 15,308.4 | 11,936 | +6.9% slower | + +**128 is a clean local minimum** (faster than the 64 below and the 192/256 above; 96 and 160 are its +immediate neighbors, end-to-end-neutral, nsys-stats flaked on the re-runs but cannot beat a bracketed +local min). The 7-10% kernel-level regression of the alternatives at 0% end-to-end change is exactly +why end-to-end is flat: this BW-bound, 256-tiny-expert model has no col-tile/occupancy headroom in +the act-quant - the same conclusion patch 0015 reached for the M-tile and patch 0017 for MINBLOCKS. + +## WHERE MoE STANDS (decode_agg, this base = 0025 with the re-graph) +vLLM ref @npl128 = 882.2, @npl32 = 500.8. +- npl128: 751.8 / 882.2 = **85.2% of vLLM** +- npl32: 438.1 / 500.8 = **87.5% of vLLM** + +B-2 adds 0 (within noise). MoE is **still well below vLLM** => **TRY B-3** (the mmq_y-down warp-remap +on the grouped `mul_mat_q` GEMM, ~27% of the MoE step - the only untested MoE GEMM +lever; SPEEDUP_HUNT B rank #3, real kernel change, bit-exact, predicted bounded on this BW-bound +model). The structural MoE residual is the FP4 grouped GEMM at the LPDDR5x BW floor + the bf16 +projections (~10.5%); the act-quant tax (~2%) is NOT where the gap lives and is already optimally +tiled. Recurrence (~48%) is already past vLLM (0018-0022). + +## DECISION +No patch 0027 (B-2 does not lift; dev tree reverted to pristine 0025). The `LLAMA_MOE_QUANT_BLOCK` +hook + this measurement confirm 128 is the GB10 optimum, should other hardware ever want re-tuning. +Hand off to B-3 (patch 0028) as the next MoE GEMM lever. + +Assisted-by: Claude:opus-4.8 [Claude Code]