B-2 / M1 (SPEEDUP_HUNT rank #2): bit-exact block/grid/occupancy retune of quantize_mmq_nvfp4 (the MoE down_proj activation-quant, ~2% of the MoE decode step). Built+measured on a clean 0025 base (DGX GB10 sm_121), then reverted - it does not lift. Finding: the existing blockDim.x=128 is ALREADY the kernel-level optimum for quantize_mmq_nvfp4 on GB10. nsys (8193 invocations): block=128 total 117.4M ns is the fastest; 64 +8.7%, 192 +9.9%, 256 +6.9%. End-to-end MoE decode_agg is flat within 0.4% noise across all block sizes {32..256} (npl32 ~438, npl128 ~751 t/s). The act-quant is ~2% of a BW-bound step, so even a perfect kernel caps the win at ~2%, and 128 is already optimal => measured 0%. Same outcome as patch 0015 (M-tile) and 0017 (MINBLOCKS): no occupancy headroom on this 256-tiny-expert BW-bound model. Bit-exactness proven: md5 identical at block 64/128/256 for both models (the per-thread quant body is untouched; thread->output map is invariant to blockDim.x). Gate at default: dense 5951a5b4 == ref, MoE 07db32c2 == ref, MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS. MoE stays ~85% of vLLM @npl128 / ~87% @npl32 - still well below vLLM, so the remaining MoE lever is B-3 (mmq_y-down warp-remap on the grouped FP4 GEMM). No patch 0027; dev tree reverted to pristine 0025. Full data in B_MOE_RESULTS.md. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2.5 KiB
B_MOE_PROGRESS.md - B-2 (down_proj act-quant retune, patch 0027) checkpoint
Agent: B2-build (GPU agent). Base: 0025 tip (DGX ~/llama-paged-dev 2f4f5ab, branch b-work),
independent of the held hybrid 0026. Worktree: .../feat+paged-attention.
The lever (B-2 / M1)
Bit-exact block/grid/occupancy retune of quantize_mmq_nvfp4 (the MoE down_proj activation-quant,
~2% of the MoE decode step). ggml/src/ggml-cuda/quantize.cu, quantize_mmq_fp4_cuda NVFP4 branch.
Why it is provably byte-identical
quantize_mmq_nvfp4 maps thread -> column purely through the global linear index
gy = blockDim.x*blockIdx.y + threadIdx.x -> i0_base = gy*QK_NVFP4_SUB, with NO cross-thread
communication (no shared memory, no warp reduction) and every thread owning a disjoint output
sub-block (its own sub slot in block_fp4_mmq). So the (thread)->output-byte map - and thus the
produced bytes - are invariant to blockDim.x as long as block_num_y is recomputed from the SAME
blockDim.x. We retune ONLY blockDim.x; the per-thread quant body + writeback are untouched.
Change
static const int nvfp4_block_size selected once via env LLAMA_MOE_QUANT_BLOCK (default 128 =
baseline; final = measured GB10 winner), block_num_y recomputed consistently. ~20 LOC, one TU.
Status: COMPLETE - NEGATIVE (no lift). Full result in B_MOE_RESULTS.md.
- Branched
b-workoff 0025 (2f4f5ab); patch applied to quantize.cu. - Build clean (llama-completion, llama-batched-bench, test-backend-ops). BUILD_EXIT=0.
- md5 gate @block=128 (default): dense 5951a5b4 == ref, MoE 07db32c2 == ref. MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS.
- BIT-EXACT proof across block sizes: block 64 AND 256 -> identical md5 both models.
- Sweep block {32,64,96,128,160,192,256}: end-to-end FLAT (npl32 436-438, npl128 749-752, all within 0.4% noise). NO block lifts decode.
- nsys quantize_mmq_nvfp4: block=128 is the FASTEST (117.4M ns; 64 +8.7%, 192 +9.9%, 256 +6.9%). 128 already optimal => ZERO headroom.
- DECISION: no patch 0027 (does not lift). Dev tree reverted to pristine 0025. Recommend B-3.
Gate references
- dense q36-27b-nvfp4 md5 == 5951a5b4d624ce891e22ab5fca9bc439
- MoE q36-35b-a3b-nvfp4 md5 == 07db32c2bcb78d17a43ed18bc22705cd
- gate cmd:
llama-completion -m M -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1 - bench:
llama-batched-bench -m M -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128(S_TG=decode_agg) - vLLM ref decode_agg @npl128 = 882.2 t/s (npl32 ref 500.8).
Assisted-by: Claude:opus-4.8 [Claude Code]