feat(paged): mirror patch 0014 - expert-aware MoE token-tile cap

Mirror of the dev-tree engine patch (ggml mmq.cuh) into the paged patch set,
plus its measurement writeup. Adds LLAMA_MOE_MMQ_X, an opt-in env cap on the MoE
grouped-GEMM token-tile (mmq_x) for the MUL_MAT_ID path; default-off =
byte-identical to stock.

Honest result of the MoE near-term lever: the npl128 decode cliff does NOT exist
on current HEAD (stock decode is monotonic 85/282/629/935/1295/1779 t/s at npl
1/8/32/64/128/256; the old cliff was fixed upstream by the sorted grouped
FP4-MMA GEMM + MoE stream-k). The cap is therefore not a cliff fix but a modest
high-batch decode micro-optimization: cap64 gives +4.8% decode at npl128 and
+2.3% at npl256 (reproducible, neutral at npl<=64) for a ~1.3% prefill cost;
cap16/cap32 are net-negative (prefill -41% / -17%). Full tables in
MOE_TOKEN_TILE_CAP.md; durable density-aware follow-up in
MOE_GROUPED_GEMM_SCOPE.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-23 13:49:15 +00:00
parent 8925c009b7
commit 010067d900
2 changed files with 239 additions and 0 deletions

View File

@@ -0,0 +1,140 @@
From 652b858252b354f4d4fb49e5ed7468eeee8e32fc Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 23 Jun 2026 15:47:06 +0200
Subject: [PATCH] feat(paged): expert-aware MoE token-tile cap (patch 0014)
On GB10 (sm_121) the Qwen3-30B-A3B-class mxfp4 MoE decode path already uses the
sorted grouped FP4-MMA GEMM (MUL_MAT_ID -> ggml_cuda_mul_mat_q ids branch:
mm_ids_helper moe_align/scatter + one persistent stream-k mul_mat_q), so the
originally reported npl128 throughput cliff does NOT reproduce on this build.
llama-batched-bench decode (S_TG t/s) is monotonic across batch:
npl 1 8 32 64 128 256
S_TG 85 282 629 935 1295 1779 (stock, mxfp4 MoE, -fa on)
There is no knee to erase; the old cliff (a real high-batch regression, 620 t/s
at npl128) was fixed upstream by grouped-mmq + MoE stream-k load balancing.
What remains is a pure tile-shape micro-inefficiency. In mul_mat_q_case the
token-tile width mmq_x is chosen to cover ncols_max (= ne12, the per-expert
column upper bound = token count, up to 128) in one column-tile. At MoE decode
the per-expert token density is ~ne12*k/n_experts (top-8 of 128 => ~1/16 of
ne12, e.g. ~8 tokens/expert at npl128), so each expert's single mmq_x-wide
col-tile is only ~6% filled: the MMA accumulator tile is mmq_x-wide at compile
time and burns throughput on the padding columns while the larger y-tile lowers
occupancy. Stock picks the LARGEST tile (128) where the SMALLEST tile that still
covers the density would raise fill + occupancy at no extra weight read (at
tokens/expert <= mmq_x there is exactly one non-empty col-tile per expert; the
emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k
kernel) - the inverse of vLLM's small per-expert BLOCK_SIZE_M.
Add LLAMA_MOE_MMQ_X: an env cap on mmq_x for the MUL_MAT_ID path only
(expert_bounds != nullptr). Default (unset or <= 0) = disabled, so the mmq_x
selection, and therefore every kernel launched, is byte-identical to stock. The
cap only ever lowers the loop's upper bound and still selects from the same
granularity- and shared-memory-validated mmq_x set stock already uses for
smaller batches, so no new kernel configuration is exercised.
Measured on GB10, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, same binary,
only LLAMA_MOE_MMQ_X differs (decode S_TG t/s / prefill S_PP t/s):
npl stock S_TG cap64 S_TG d% stock S_PP cap64 S_PP
64 936 938 +0.1 2924 2883
128 1295 1357 +4.8 3075 3038
256 1784 1825 +2.3 3085 3046
(reproduced across interleaved reps; cap64 npl128 = 1357.5/1357.0, very stable)
cap64 lifts high-batch decode +4.8% (npl128) / +2.3% (npl256), neutral at
npl <= 64, for a consistent ~1.3% prefill cost. Smaller caps are net-negative:
cap16 / cap32 crater prefill -41% / -17% (a 512-token prefill ubatch has ~32
tokens/expert, which overflows a 16/32-wide tile into extra col-tiles + weight
re-reads), so 64 is the recommended value and the only one that helps net.
Honest framing: this is NOT a cliff fix (no cliff exists) and not a real-server
throughput unlock (llama-server continuous batching already scales). It is a
modest high-effective-batch DECODE micro-optimization that matches vLLM's
smaller per-expert M-tiling, surfaced as an opt-in, default-off knob. The
durable density-aware auto-select (drop the blunt global cap, choose mmq_x from
ne_get_rows / n_active_experts so prefill keeps its large tile) is scoped in
patches/paged/MOE_GROUPED_GEMM_SCOPE.md.
Correctness: greedy temp-0 llama-server output with cap64 is byte-identical to
stock for single-stream generation (fibonacci / capital-of-France / photosynthesis
prompts) and stays coherent; batched-bench ran thousands of capped MoE matmuls at
npl128/256 (mmq_x forced 128 -> 64) with no CUDA error / NaN and stable output.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/src/ggml-cuda/mmq.cuh | 37 ++++++++++++++++++++++++++++++++++++-
1 file changed, 36 insertions(+), 1 deletion(-)
diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
index edf546d..cff608e 100644
--- a/ggml/src/ggml-cuda/mmq.cuh
+++ b/ggml/src/ggml-cuda/mmq.cuh
@@ -6,6 +6,7 @@
#include <climits>
#include <cstdint>
+#include <cstdlib>
using namespace ggml_cuda_mma;
@@ -4052,6 +4053,18 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
}
}
+// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X.
+// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical).
+// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the
+// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case).
+static inline int ggml_cuda_moe_mmq_x_cap() {
+ static const int cap = []() -> int {
+ const char * s = getenv("LLAMA_MOE_MMQ_X");
+ return s ? atoi(s) : 0;
+ }();
+ return cap;
+}
+
template <ggml_type type>
void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
const int id = ggml_cuda_get_device();
@@ -4063,10 +4076,32 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
const int mmq_x_max = get_mmq_x_max_host(cc);
const int mmq_y = get_mmq_y_host(cc);
+ // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap.
+ // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are
+ // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count,
+ // up to 128) in a single column-tile. At MoE decode the per-expert token density is low
+ // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for
+ // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty:
+ // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the
+ // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the
+ // per-expert density raises tile fill + occupancy with no extra weight reads (at
+ // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the
+ // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel).
+ // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock;
+ // off the ids path the cap never applies.
+ int mmq_x_lim = mmq_x_max;
+ if (args.expert_bounds != nullptr) {
+ const int moe_cap = ggml_cuda_moe_mmq_x_cap();
+ if (moe_cap > 0) {
+ const int cap = moe_cap < 8 ? 8 : moe_cap;
+ mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max;
+ }
+ }
+
int mmq_x_best = 0;
int ntiles_x_best = INT_MAX;
- for (int mmq_x = 8; mmq_x <= mmq_x_max && ntiles_x_best > 1; mmq_x += 8) {
+ for (int mmq_x = 8; mmq_x <= mmq_x_lim && ntiles_x_best > 1; mmq_x += 8) {
const int granularity = mmq_get_granularity_host(mmq_x, cc);
if (mmq_x % granularity != 0 || mmq_get_nbytes_shared<type>(mmq_x, mmq_y, cc, warp_size, nwarps) > smpbo) {
--
2.43.0

View File

@@ -0,0 +1,99 @@
# Patch 0014 findings: expert-aware MoE token-tile cap (LLAMA_MOE_MMQ_X)
Near-term lever for the MoE-vs-vLLM workflow on GB10 (sm_121). Companion to
`0014-paged-expert-aware-moe-token-tile-cap.patch`. Model:
Qwen3-Coder-30B-A3B, 128 experts, top-8, mxfp4 experts
(`~/bench/qwen3coder-mxfp4.gguf`). Dev tree `~/llama-paged-dev` (branch `paged`),
`build-cuda` sm_121.
## Headline (honest): there is no npl128 cliff to erase on this build
The mission premise was a 25% decode drop at npl128 (batched-bench 253/505/830/620
@ npl 8/32/64/128). It does **not** reproduce. Stock decode is monotonic:
```
llama-batched-bench, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, S_TG t/s
npl 1 8 32 64 128 256
stock 85 282 629 935 1295 1779 <- monotonic, no knee
```
The old cliff was a real high-batch regression since fixed upstream: mxfp4 MoE
decode on GB10 already takes the sorted grouped FP4-MMA GEMM (MUL_MAT_ID ->
`ggml_cuda_mul_mat_q` ids branch: `mm_ids_helper` moe_align/scatter + one
persistent stream-k `mul_mat_q`), i.e. vLLM's algorithm. See
`MOE_GROUPED_GEMM_SCOPE.md`.
## What the knob does
`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max`
(= `ne12`, the per-expert column upper bound = token count, up to 128) in one
column-tile. At MoE decode the per-expert density is `~ne12*k/n_experts`
(top-8/128 => ~1/16 of `ne12`), so each expert's `mmq_x`-wide col-tile is only
~6% filled: the MMA accumulator tile is `mmq_x`-wide at compile time and wastes
throughput on the padding columns, and the larger y-tile lowers occupancy.
`LLAMA_MOE_MMQ_X=<n>` caps `mmq_x` on the MUL_MAT_ID path only
(`expert_bounds != nullptr`). It only lowers the selection-loop upper bound and
still chooses from the same granularity/shared-memory-validated `mmq_x` set stock
already uses for smaller batches - no new kernel configuration. Default
(unset/<=0) = disabled => byte-identical to stock.
## Measurements (same binary, only LLAMA_MOE_MMQ_X differs)
Decode throughput, S_TG t/s:
```
npl stock cap16 cap32 cap64
1 85 85 85 85
8 282 280 282 282
32 629 623 629 628
64 935 915 949 934
128 1295 1204 1344 1357 <- cap64 +4.8% (cap16 -7%)
256 1779 1370 1723 1820 <- cap64 +2.3% (cap16 -23%)
```
Prefill throughput, S_PP t/s (the cost):
```
npl stock cap16 cap32 cap64
128 3083 1817 2559 3038
256 3084 1818 2560 3046
-41% -17% -1.3%
```
Reproducibility (interleaved off/cap64, two reps each):
```
npl off rep1/rep2 cap64 rep1/rep2
128 1300 / 1290 1357.5 / 1357.0
256 1786 / 1782 1826.3 / 1824.5
```
cap64 is stable to <0.1% and the gain sits well above the ~1% run-to-run band.
## Why 64 is the only value that helps net
A 512-token prefill ubatch routes ~32 tokens/expert. cap16/cap32 force those into
16/32-wide tiles, overflowing into extra col-tiles + weight re-reads -> prefill
craters (-41% / -17%). cap64 still holds the prefill density in one tile (32 < 64)
so prefill is near-neutral (-1.3%), while decode (~8 tokens/expert at npl128) gets
the fuller, higher-occupancy tile.
## Verdict
- Real but **modest** high-effective-batch DECODE micro-optimization
(+4.8% npl128, +2.3% npl256), neutral at npl<=64, ~1.3% prefill cost at cap64.
- **Not** a cliff fix (no cliff) and **not** a real-server unlock (llama-server
continuous batching already scales). Shipped as an opt-in, default-off knob;
recommended value 64 for decode-heavy high-concurrency deployments.
- Correctness: greedy temp-0 server output with cap64 is byte-identical to stock
for single-stream generation and stays coherent; thousands of capped MoE
matmuls at npl128/256 ran with no CUDA error / NaN.
## Durable follow-up (scoped, not implemented)
Replace the blunt global cap with a density-aware auto-select: choose `mmq_x`
from `ne_get_rows / n_active_experts` inside `mul_mat_q_case` so decode gets the
small tile while prefill keeps its large tile automatically (removes the ~1.3%
prefill cost). Plus the block-padded `moe_align` in `mm_ids_helper`. See
`MOE_GROUPED_GEMM_SCOPE.md`.