From 010067d900f1c3f9582198970913a157a800a8ae Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Tue, 23 Jun 2026 13:49:15 +0000 Subject: [PATCH] feat(paged): mirror patch 0014 - expert-aware MoE token-tile cap Mirror of the dev-tree engine patch (ggml mmq.cuh) into the paged patch set, plus its measurement writeup. Adds LLAMA_MOE_MMQ_X, an opt-in env cap on the MoE grouped-GEMM token-tile (mmq_x) for the MUL_MAT_ID path; default-off = byte-identical to stock. Honest result of the MoE near-term lever: the npl128 decode cliff does NOT exist on current HEAD (stock decode is monotonic 85/282/629/935/1295/1779 t/s at npl 1/8/32/64/128/256; the old cliff was fixed upstream by the sorted grouped FP4-MMA GEMM + MoE stream-k). The cap is therefore not a cliff fix but a modest high-batch decode micro-optimization: cap64 gives +4.8% decode at npl128 and +2.3% at npl256 (reproducible, neutral at npl<=64) for a ~1.3% prefill cost; cap16/cap32 are net-negative (prefill -41% / -17%). Full tables in MOE_TOKEN_TILE_CAP.md; durable density-aware follow-up in MOE_GROUPED_GEMM_SCOPE.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- ...aged-expert-aware-moe-token-tile-cap.patch | 140 ++++++++++++++++++ .../patches/paged/MOE_TOKEN_TILE_CAP.md | 99 +++++++++++++ 2 files changed, 239 insertions(+) create mode 100644 backend/cpp/llama-cpp/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch create mode 100644 backend/cpp/llama-cpp/patches/paged/MOE_TOKEN_TILE_CAP.md diff --git a/backend/cpp/llama-cpp/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch b/backend/cpp/llama-cpp/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch new file mode 100644 index 000000000..fc9ff66b5 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch @@ -0,0 +1,140 @@ +From 652b858252b354f4d4fb49e5ed7468eeee8e32fc Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Tue, 23 Jun 2026 15:47:06 +0200 +Subject: [PATCH] feat(paged): expert-aware MoE token-tile cap (patch 0014) + +On GB10 (sm_121) the Qwen3-30B-A3B-class mxfp4 MoE decode path already uses the +sorted grouped FP4-MMA GEMM (MUL_MAT_ID -> ggml_cuda_mul_mat_q ids branch: +mm_ids_helper moe_align/scatter + one persistent stream-k mul_mat_q), so the +originally reported npl128 throughput cliff does NOT reproduce on this build. +llama-batched-bench decode (S_TG t/s) is monotonic across batch: + + npl 1 8 32 64 128 256 + S_TG 85 282 629 935 1295 1779 (stock, mxfp4 MoE, -fa on) + +There is no knee to erase; the old cliff (a real high-batch regression, 620 t/s +at npl128) was fixed upstream by grouped-mmq + MoE stream-k load balancing. + +What remains is a pure tile-shape micro-inefficiency. In mul_mat_q_case the +token-tile width mmq_x is chosen to cover ncols_max (= ne12, the per-expert +column upper bound = token count, up to 128) in one column-tile. At MoE decode +the per-expert token density is ~ne12*k/n_experts (top-8 of 128 => ~1/16 of +ne12, e.g. ~8 tokens/expert at npl128), so each expert's single mmq_x-wide +col-tile is only ~6% filled: the MMA accumulator tile is mmq_x-wide at compile +time and burns throughput on the padding columns while the larger y-tile lowers +occupancy. Stock picks the LARGEST tile (128) where the SMALLEST tile that still +covers the density would raise fill + occupancy at no extra weight read (at +tokens/expert <= mmq_x there is exactly one non-empty col-tile per expert; the +emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k +kernel) - the inverse of vLLM's small per-expert BLOCK_SIZE_M. + +Add LLAMA_MOE_MMQ_X: an env cap on mmq_x for the MUL_MAT_ID path only +(expert_bounds != nullptr). Default (unset or <= 0) = disabled, so the mmq_x +selection, and therefore every kernel launched, is byte-identical to stock. The +cap only ever lowers the loop's upper bound and still selects from the same +granularity- and shared-memory-validated mmq_x set stock already uses for +smaller batches, so no new kernel configuration is exercised. + +Measured on GB10, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, same binary, +only LLAMA_MOE_MMQ_X differs (decode S_TG t/s / prefill S_PP t/s): + + npl stock S_TG cap64 S_TG d% stock S_PP cap64 S_PP + 64 936 938 +0.1 2924 2883 + 128 1295 1357 +4.8 3075 3038 + 256 1784 1825 +2.3 3085 3046 + + (reproduced across interleaved reps; cap64 npl128 = 1357.5/1357.0, very stable) + +cap64 lifts high-batch decode +4.8% (npl128) / +2.3% (npl256), neutral at +npl <= 64, for a consistent ~1.3% prefill cost. Smaller caps are net-negative: +cap16 / cap32 crater prefill -41% / -17% (a 512-token prefill ubatch has ~32 +tokens/expert, which overflows a 16/32-wide tile into extra col-tiles + weight +re-reads), so 64 is the recommended value and the only one that helps net. + +Honest framing: this is NOT a cliff fix (no cliff exists) and not a real-server +throughput unlock (llama-server continuous batching already scales). It is a +modest high-effective-batch DECODE micro-optimization that matches vLLM's +smaller per-expert M-tiling, surfaced as an opt-in, default-off knob. The +durable density-aware auto-select (drop the blunt global cap, choose mmq_x from +ne_get_rows / n_active_experts so prefill keeps its large tile) is scoped in +patches/paged/MOE_GROUPED_GEMM_SCOPE.md. + +Correctness: greedy temp-0 llama-server output with cap64 is byte-identical to +stock for single-stream generation (fibonacci / capital-of-France / photosynthesis +prompts) and stays coherent; batched-bench ran thousands of capped MoE matmuls at +npl128/256 (mmq_x forced 128 -> 64) with no CUDA error / NaN and stable output. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/mmq.cuh | 37 ++++++++++++++++++++++++++++++++++++- + 1 file changed, 36 insertions(+), 1 deletion(-) + +diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh +index edf546d..cff608e 100644 +--- a/ggml/src/ggml-cuda/mmq.cuh ++++ b/ggml/src/ggml-cuda/mmq.cuh +@@ -6,6 +6,7 @@ + + #include + #include ++#include + + using namespace ggml_cuda_mma; + +@@ -4052,6 +4053,18 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a + } + } + ++// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X. ++// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical). ++// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the ++// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case). ++static inline int ggml_cuda_moe_mmq_x_cap() { ++ static const int cap = []() -> int { ++ const char * s = getenv("LLAMA_MOE_MMQ_X"); ++ return s ? atoi(s) : 0; ++ }(); ++ return cap; ++} ++ + template + void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) { + const int id = ggml_cuda_get_device(); +@@ -4063,10 +4076,32 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda + const int mmq_x_max = get_mmq_x_max_host(cc); + const int mmq_y = get_mmq_y_host(cc); + ++ // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap. ++ // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are ++ // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count, ++ // up to 128) in a single column-tile. At MoE decode the per-expert token density is low ++ // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for ++ // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty: ++ // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the ++ // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the ++ // per-expert density raises tile fill + occupancy with no extra weight reads (at ++ // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the ++ // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel). ++ // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock; ++ // off the ids path the cap never applies. ++ int mmq_x_lim = mmq_x_max; ++ if (args.expert_bounds != nullptr) { ++ const int moe_cap = ggml_cuda_moe_mmq_x_cap(); ++ if (moe_cap > 0) { ++ const int cap = moe_cap < 8 ? 8 : moe_cap; ++ mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max; ++ } ++ } ++ + int mmq_x_best = 0; + int ntiles_x_best = INT_MAX; + +- for (int mmq_x = 8; mmq_x <= mmq_x_max && ntiles_x_best > 1; mmq_x += 8) { ++ for (int mmq_x = 8; mmq_x <= mmq_x_lim && ntiles_x_best > 1; mmq_x += 8) { + const int granularity = mmq_get_granularity_host(mmq_x, cc); + + if (mmq_x % granularity != 0 || mmq_get_nbytes_shared(mmq_x, mmq_y, cc, warp_size, nwarps) > smpbo) { +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp/patches/paged/MOE_TOKEN_TILE_CAP.md b/backend/cpp/llama-cpp/patches/paged/MOE_TOKEN_TILE_CAP.md new file mode 100644 index 000000000..88602291d --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/MOE_TOKEN_TILE_CAP.md @@ -0,0 +1,99 @@ +# Patch 0014 findings: expert-aware MoE token-tile cap (LLAMA_MOE_MMQ_X) + +Near-term lever for the MoE-vs-vLLM workflow on GB10 (sm_121). Companion to +`0014-paged-expert-aware-moe-token-tile-cap.patch`. Model: +Qwen3-Coder-30B-A3B, 128 experts, top-8, mxfp4 experts +(`~/bench/qwen3coder-mxfp4.gguf`). Dev tree `~/llama-paged-dev` (branch `paged`), +`build-cuda` sm_121. + +## Headline (honest): there is no npl128 cliff to erase on this build + +The mission premise was a 25% decode drop at npl128 (batched-bench 253/505/830/620 +@ npl 8/32/64/128). It does **not** reproduce. Stock decode is monotonic: + +``` +llama-batched-bench, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, S_TG t/s + npl 1 8 32 64 128 256 + stock 85 282 629 935 1295 1779 <- monotonic, no knee +``` + +The old cliff was a real high-batch regression since fixed upstream: mxfp4 MoE +decode on GB10 already takes the sorted grouped FP4-MMA GEMM (MUL_MAT_ID -> +`ggml_cuda_mul_mat_q` ids branch: `mm_ids_helper` moe_align/scatter + one +persistent stream-k `mul_mat_q`), i.e. vLLM's algorithm. See +`MOE_GROUPED_GEMM_SCOPE.md`. + +## What the knob does + +`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max` +(= `ne12`, the per-expert column upper bound = token count, up to 128) in one +column-tile. At MoE decode the per-expert density is `~ne12*k/n_experts` +(top-8/128 => ~1/16 of `ne12`), so each expert's `mmq_x`-wide col-tile is only +~6% filled: the MMA accumulator tile is `mmq_x`-wide at compile time and wastes +throughput on the padding columns, and the larger y-tile lowers occupancy. + +`LLAMA_MOE_MMQ_X=` caps `mmq_x` on the MUL_MAT_ID path only +(`expert_bounds != nullptr`). It only lowers the selection-loop upper bound and +still chooses from the same granularity/shared-memory-validated `mmq_x` set stock +already uses for smaller batches - no new kernel configuration. Default +(unset/<=0) = disabled => byte-identical to stock. + +## Measurements (same binary, only LLAMA_MOE_MMQ_X differs) + +Decode throughput, S_TG t/s: + +``` + npl stock cap16 cap32 cap64 + 1 85 85 85 85 + 8 282 280 282 282 + 32 629 623 629 628 + 64 935 915 949 934 + 128 1295 1204 1344 1357 <- cap64 +4.8% (cap16 -7%) + 256 1779 1370 1723 1820 <- cap64 +2.3% (cap16 -23%) +``` + +Prefill throughput, S_PP t/s (the cost): + +``` + npl stock cap16 cap32 cap64 + 128 3083 1817 2559 3038 + 256 3084 1818 2560 3046 + -41% -17% -1.3% +``` + +Reproducibility (interleaved off/cap64, two reps each): + +``` + npl off rep1/rep2 cap64 rep1/rep2 + 128 1300 / 1290 1357.5 / 1357.0 + 256 1786 / 1782 1826.3 / 1824.5 +``` + +cap64 is stable to <0.1% and the gain sits well above the ~1% run-to-run band. + +## Why 64 is the only value that helps net + +A 512-token prefill ubatch routes ~32 tokens/expert. cap16/cap32 force those into +16/32-wide tiles, overflowing into extra col-tiles + weight re-reads -> prefill +craters (-41% / -17%). cap64 still holds the prefill density in one tile (32 < 64) +so prefill is near-neutral (-1.3%), while decode (~8 tokens/expert at npl128) gets +the fuller, higher-occupancy tile. + +## Verdict + +- Real but **modest** high-effective-batch DECODE micro-optimization + (+4.8% npl128, +2.3% npl256), neutral at npl<=64, ~1.3% prefill cost at cap64. +- **Not** a cliff fix (no cliff) and **not** a real-server unlock (llama-server + continuous batching already scales). Shipped as an opt-in, default-off knob; + recommended value 64 for decode-heavy high-concurrency deployments. +- Correctness: greedy temp-0 server output with cap64 is byte-identical to stock + for single-stream generation and stays coherent; thousands of capped MoE + matmuls at npl128/256 ran with no CUDA error / NaN. + +## Durable follow-up (scoped, not implemented) + +Replace the blunt global cap with a density-aware auto-select: choose `mmq_x` +from `ne_get_rows / n_active_experts` inside `mul_mat_q_case` so decode gets the +small tile while prefill keeps its large tile automatically (removes the ~1.3% +prefill cost). Plus the block-padded `moe_align` in `mm_ids_helper`. See +`MOE_GROUPED_GEMM_SCOPE.md`.