Files
LocalAI/backend/cpp/llama-cpp/patches/paged/B_MOE_PROGRESS.md
Ettore Di Giacinto 1f857f179e docs(paged): B-2 down_proj act-quant retune RESULT - negative (no headroom)
B-2 / M1 (SPEEDUP_HUNT rank #2): bit-exact block/grid/occupancy retune of
quantize_mmq_nvfp4 (the MoE down_proj activation-quant, ~2% of the MoE decode
step). Built+measured on a clean 0025 base (DGX GB10 sm_121), then reverted -
it does not lift.

Finding: the existing blockDim.x=128 is ALREADY the kernel-level optimum for
quantize_mmq_nvfp4 on GB10. nsys (8193 invocations): block=128 total 117.4M ns
is the fastest; 64 +8.7%, 192 +9.9%, 256 +6.9%. End-to-end MoE decode_agg is
flat within 0.4% noise across all block sizes {32..256} (npl32 ~438, npl128
~751 t/s). The act-quant is ~2% of a BW-bound step, so even a perfect kernel
caps the win at ~2%, and 128 is already optimal => measured 0%. Same outcome as
patch 0015 (M-tile) and 0017 (MINBLOCKS): no occupancy headroom on this
256-tiny-expert BW-bound model.

Bit-exactness proven: md5 identical at block 64/128/256 for both models (the
per-thread quant body is untouched; thread->output map is invariant to
blockDim.x). Gate at default: dense 5951a5b4 == ref, MoE 07db32c2 == ref,
MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS.

MoE stays ~85% of vLLM @npl128 / ~87% @npl32 - still well below vLLM, so the
remaining MoE lever is B-3 (mmq_y-down warp-remap on the grouped FP4 GEMM).
No patch 0027; dev tree reverted to pristine 0025. Full data in B_MOE_RESULTS.md.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 18:31:51 +00:00

2.5 KiB

B_MOE_PROGRESS.md - B-2 (down_proj act-quant retune, patch 0027) checkpoint

Agent: B2-build (GPU agent). Base: 0025 tip (DGX ~/llama-paged-dev 2f4f5ab, branch b-work), independent of the held hybrid 0026. Worktree: .../feat+paged-attention.

The lever (B-2 / M1)

Bit-exact block/grid/occupancy retune of quantize_mmq_nvfp4 (the MoE down_proj activation-quant, ~2% of the MoE decode step). ggml/src/ggml-cuda/quantize.cu, quantize_mmq_fp4_cuda NVFP4 branch.

Why it is provably byte-identical

quantize_mmq_nvfp4 maps thread -> column purely through the global linear index gy = blockDim.x*blockIdx.y + threadIdx.x -> i0_base = gy*QK_NVFP4_SUB, with NO cross-thread communication (no shared memory, no warp reduction) and every thread owning a disjoint output sub-block (its own sub slot in block_fp4_mmq). So the (thread)->output-byte map - and thus the produced bytes - are invariant to blockDim.x as long as block_num_y is recomputed from the SAME blockDim.x. We retune ONLY blockDim.x; the per-thread quant body + writeback are untouched.

Change

static const int nvfp4_block_size selected once via env LLAMA_MOE_QUANT_BLOCK (default 128 = baseline; final = measured GB10 winner), block_num_y recomputed consistently. ~20 LOC, one TU.

Status: COMPLETE - NEGATIVE (no lift). Full result in B_MOE_RESULTS.md.

  • Branched b-work off 0025 (2f4f5ab); patch applied to quantize.cu.
  • Build clean (llama-completion, llama-batched-bench, test-backend-ops). BUILD_EXIT=0.
  • md5 gate @block=128 (default): dense 5951a5b4 == ref, MoE 07db32c2 == ref. MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS.
  • BIT-EXACT proof across block sizes: block 64 AND 256 -> identical md5 both models.
  • Sweep block {32,64,96,128,160,192,256}: end-to-end FLAT (npl32 436-438, npl128 749-752, all within 0.4% noise). NO block lifts decode.
  • nsys quantize_mmq_nvfp4: block=128 is the FASTEST (117.4M ns; 64 +8.7%, 192 +9.9%, 256 +6.9%). 128 already optimal => ZERO headroom.
  • DECISION: no patch 0027 (does not lift). Dev tree reverted to pristine 0025. Recommend B-3.

Gate references

  • dense q36-27b-nvfp4 md5 == 5951a5b4d624ce891e22ab5fca9bc439
  • MoE q36-35b-a3b-nvfp4 md5 == 07db32c2bcb78d17a43ed18bc22705cd
  • gate cmd: llama-completion -m M -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1
  • bench: llama-batched-bench -m M -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128 (S_TG=decode_agg)
  • vLLM ref decode_agg @npl128 = 882.2 t/s (npl32 ref 500.8).

Assisted-by: Claude:opus-4.8 [Claude Code]