docs(paged): B-2 down_proj act-quant retune RESULT - negative (no headroom)

B-2 / M1 (SPEEDUP_HUNT rank #2): bit-exact block/grid/occupancy retune of
quantize_mmq_nvfp4 (the MoE down_proj activation-quant, ~2% of the MoE decode
step). Built+measured on a clean 0025 base (DGX GB10 sm_121), then reverted -
it does not lift.

Finding: the existing blockDim.x=128 is ALREADY the kernel-level optimum for
quantize_mmq_nvfp4 on GB10. nsys (8193 invocations): block=128 total 117.4M ns
is the fastest; 64 +8.7%, 192 +9.9%, 256 +6.9%. End-to-end MoE decode_agg is
flat within 0.4% noise across all block sizes {32..256} (npl32 ~438, npl128
~751 t/s). The act-quant is ~2% of a BW-bound step, so even a perfect kernel
caps the win at ~2%, and 128 is already optimal => measured 0%. Same outcome as
patch 0015 (M-tile) and 0017 (MINBLOCKS): no occupancy headroom on this
256-tiny-expert BW-bound model.

Bit-exactness proven: md5 identical at block 64/128/256 for both models (the
per-thread quant body is untouched; thread->output map is invariant to
blockDim.x). Gate at default: dense 5951a5b4 == ref, MoE 07db32c2 == ref,
MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS.

MoE stays ~85% of vLLM @npl128 / ~87% @npl32 - still well below vLLM, so the
remaining MoE lever is B-3 (mmq_y-down warp-remap on the grouped FP4 GEMM).
No patch 0027; dev tree reverted to pristine 0025. Full data in B_MOE_RESULTS.md.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-26 18:31:51 +00:00
parent 33dfe7fd41
commit 1f857f179e
2 changed files with 131 additions and 0 deletions

View File

@@ -0,0 +1,41 @@
# B_MOE_PROGRESS.md - B-2 (down_proj act-quant retune, patch 0027) checkpoint
Agent: B2-build (GPU agent). Base: 0025 tip (DGX `~/llama-paged-dev` `2f4f5ab`, branch `b-work`),
independent of the held hybrid 0026. Worktree: `.../feat+paged-attention`.
## The lever (B-2 / M1)
Bit-exact block/grid/occupancy retune of `quantize_mmq_nvfp4` (the MoE down_proj activation-quant,
~2% of the MoE decode step). `ggml/src/ggml-cuda/quantize.cu`, `quantize_mmq_fp4_cuda` NVFP4 branch.
## Why it is provably byte-identical
`quantize_mmq_nvfp4` maps thread -> column purely through the global linear index
`gy = blockDim.x*blockIdx.y + threadIdx.x` -> `i0_base = gy*QK_NVFP4_SUB`, with NO cross-thread
communication (no shared memory, no warp reduction) and every thread owning a disjoint output
sub-block (its own `sub` slot in `block_fp4_mmq`). So the (thread)->output-byte map - and thus the
produced bytes - are invariant to `blockDim.x` as long as `block_num_y` is recomputed from the SAME
`blockDim.x`. We retune ONLY `blockDim.x`; the per-thread quant body + writeback are untouched.
## Change
`static const int nvfp4_block_size` selected once via env `LLAMA_MOE_QUANT_BLOCK` (default 128 =
baseline; final = measured GB10 winner), `block_num_y` recomputed consistently. ~20 LOC, one TU.
## Status: COMPLETE - NEGATIVE (no lift). Full result in B_MOE_RESULTS.md.
- [x] Branched `b-work` off 0025 (`2f4f5ab`); patch applied to quantize.cu.
- [x] Build clean (llama-completion, llama-batched-bench, test-backend-ops). BUILD_EXIT=0.
- [x] md5 gate @block=128 (default): dense 5951a5b4 == ref, MoE 07db32c2 == ref. MUL_MAT 1146/1146,
MUL_MAT_ID 806/806 PASS.
- [x] BIT-EXACT proof across block sizes: block 64 AND 256 -> identical md5 both models.
- [x] Sweep block {32,64,96,128,160,192,256}: end-to-end FLAT (npl32 436-438, npl128 749-752, all
within 0.4% noise). NO block lifts decode.
- [x] nsys quantize_mmq_nvfp4: block=128 is the FASTEST (117.4M ns; 64 +8.7%, 192 +9.9%, 256 +6.9%).
128 already optimal => ZERO headroom.
- [x] DECISION: no patch 0027 (does not lift). Dev tree reverted to pristine 0025. Recommend B-3.
## Gate references
- dense q36-27b-nvfp4 md5 == 5951a5b4d624ce891e22ab5fca9bc439
- MoE q36-35b-a3b-nvfp4 md5 == 07db32c2bcb78d17a43ed18bc22705cd
- gate cmd: `llama-completion -m M -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`
- bench: `llama-batched-bench -m M -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128` (S_TG=decode_agg)
- vLLM ref decode_agg @npl128 = 882.2 t/s (npl32 ref 500.8).
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -0,0 +1,90 @@
# B_MOE_RESULTS.md - B-2 (down_proj act-quant retune / M1) RESULT: NEGATIVE (no headroom)
Agent: B2-build (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev` `2f4f5ab`,
branch `b-work`), independent of the held hybrid 0026 (`33e7c65`). Lever: SPEEDUP_HUNT.md section B,
rank #2 ("down_proj act-quant retune (M1): bit-exact, bounded - act-quant is ~2% of MoE step").
## VERDICT
**The existing `blockDim.x = 128` is ALREADY the kernel-level optimum for `quantize_mmq_nvfp4` on
GB10 sm_121. B-2 has zero headroom: there is nothing to bake (128 is the current default), and it
does NOT lift MoE decode (end-to-end flat within 0.4% noise across all block sizes). No patch 0027.**
MoE stays ~85% of vLLM @npl128 / ~87% @npl32, well below vLLM => the remaining MoE lever is B-3.
## The change that was built+measured (bit-exact, then REVERTED - did not lift)
`ggml/src/ggml-cuda/quantize.cu`, `quantize_mmq_fp4_cuda` NVFP4 branch. Replaced the hardcoded
`constexpr int nvfp4_block_size = 128` with a `static const int` selected once from env
`LLAMA_MOE_QUANT_BLOCK` (default 128), `block_num_y` recomputed from the SAME `blockDim.x`. ~20 LOC.
### Why ANY block size is provably byte-identical (the bit-exact invariant)
`quantize_mmq_nvfp4` maps thread -> column purely via the global linear index
`gy = blockDim.x*blockIdx.y + threadIdx.x` -> `i0_base = gy*QK_NVFP4_SUB`, with NO cross-thread
communication (no shared memory, no warp reduction) and every thread writing its OWN disjoint output
sub-block (its own `sub` slot in `block_fp4_mmq`: `yqs[2*sub+0/1]`, `d4[sub]`). The per-thread quant
body (amax, the 5-offset fp8-code search, the q0/q1 nibble packing, the writeback) is untouched. So
the (thread)->output-byte map - and the produced bytes - are invariant to `blockDim.x`. Confirmed
empirically: md5 identical at block 64, 128, AND 256, both models.
## GATE (bit-exact) - BOTH MODELS PASS at default AND at non-128 blocks
greedy `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`:
| block | dense q36-27b-nvfp4 md5 | MoE q36-35b-a3b-nvfp4 md5 |
|------:|-------------------------|---------------------------|
| 128 (default) | 5951a5b4d624ce891e22ab5fca9bc439 == ref | 07db32c2bcb78d17a43ed18bc22705cd == ref |
| 64 | 5951a5b4...439 == ref | 07db32c2...5cd == ref |
| 256 | 5951a5b4...439 == ref | 07db32c2...5cd == ref |
test-backend-ops (CUDA0): **MUL_MAT 1146/1146 PASS**, **MUL_MAT_ID 806/806 PASS**.
## MEASUREMENT 1 - end-to-end MoE decode_agg (S_TG t/s), the actual throughput
`llama-batched-bench -m q36-35b-a3b-nvfp4.gguf -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128`,
1 rep/block (run-to-run noise ~0.3-0.5%):
| block | npl=32 S_TG | npl=128 S_TG |
|------:|------------:|-------------:|
| 32 | 437.54 | 750.41 |
| 64 | 437.82 | 751.68 |
| 96 | 437.69 | 749.46 |
| **128 (base/default)** | **438.14** | **751.76** |
| 160 | 436.38 | 750.99 |
| 192 | 436.81 | 751.61 |
| 256 | 437.77 | 750.14 |
Spread: npl32 = 1.76 t/s (0.4%), npl128 = 2.3 t/s (0.3%) - all within noise. **No block size lifts
end-to-end decode.** Expected: the act-quant is ~2% of the MoE step, so even a perfect (0 ns) quantize
kernel caps the end-to-end win at ~2%, and 128 is already optimal => measured 0%.
## MEASUREMENT 2 - nsys kernel-level delta on quantize_mmq_nvfp4 (the meaningful B-2 metric)
`nsys --report cuda_gpu_kern_sum`, MoE, `GGML_CUDA_DISABLE_GRAPHS=1 -npp 4 -ntg 32 -npl 128`,
8,193 kernel invocations (the kernel is 2.0-2.2% of GPU time in this decode-heavy window):
| block | total ns | avg ns | median ns | vs 128 (total) |
|------:|---------:|-------:|----------:|---------------:|
| 64 | 127,523,328 | 15,564.9 | 12,256 | +8.7% slower |
| **128 (default)** | **117,371,424** | **14,325.8** | **11,488** | baseline (fastest) |
| 192 | 128,970,464 | 15,741.5 | 12,032 | +9.9% slower |
| 256 | 125,422,048 | 15,308.4 | 11,936 | +6.9% slower |
**128 is a clean local minimum** (faster than the 64 below and the 192/256 above; 96 and 160 are its
immediate neighbors, end-to-end-neutral, nsys-stats flaked on the re-runs but cannot beat a bracketed
local min). The 7-10% kernel-level regression of the alternatives at 0% end-to-end change is exactly
why end-to-end is flat: this BW-bound, 256-tiny-expert model has no col-tile/occupancy headroom in
the act-quant - the same conclusion patch 0015 reached for the M-tile and patch 0017 for MINBLOCKS.
## WHERE MoE STANDS (decode_agg, this base = 0025 with the re-graph)
vLLM ref @npl128 = 882.2, @npl32 = 500.8.
- npl128: 751.8 / 882.2 = **85.2% of vLLM**
- npl32: 438.1 / 500.8 = **87.5% of vLLM**
B-2 adds 0 (within noise). MoE is **still well below vLLM** => **TRY B-3** (the mmq_y-down warp-remap
on the grouped `mul_mat_q<NVFP4,M-tile=64>` GEMM, ~27% of the MoE step - the only untested MoE GEMM
lever; SPEEDUP_HUNT B rank #3, real kernel change, bit-exact, predicted bounded on this BW-bound
model). The structural MoE residual is the FP4 grouped GEMM at the LPDDR5x BW floor + the bf16
projections (~10.5%); the act-quant tax (~2%) is NOT where the gap lives and is already optimally
tiled. Recurrence (~48%) is already past vLLM (0018-0022).
## DECISION
No patch 0027 (B-2 does not lift; dev tree reverted to pristine 0025). The `LLAMA_MOE_QUANT_BLOCK`
hook + this measurement confirm 128 is the GB10 optimum, should other hardware ever want re-tuning.
Hand off to B-3 (patch 0028) as the next MoE GEMM lever.
Assisted-by: Claude:opus-4.8 [Claude Code]