feat(paged): mirror patch 0014 - expert-aware MoE token-tile cap

Mirror of the dev-tree engine patch (ggml mmq.cuh) into the paged patch set, plus its measurement writeup. Adds LLAMA_MOE_MMQ_X, an opt-in env cap on the MoE grouped-GEMM token-tile (mmq_x) for the MUL_MAT_ID path; default-off = byte-identical to stock. Honest result of the MoE near-term lever: the npl128 decode cliff does NOT exist on current HEAD (stock decode is monotonic 85/282/629/935/1295/1779 t/s at npl 1/8/32/64/128/256; the old cliff was fixed upstream by the sorted grouped FP4-MMA GEMM + MoE stream-k). The cap is therefore not a cliff fix but a modest high-batch decode micro-optimization: cap64 gives +4.8% decode at npl128 and +2.3% at npl256 (reproducible, neutral at npl<=64) for a ~1.3% prefill cost; cap16/cap32 are net-negative (prefill -41% / -17%). Full tables in MOE_TOKEN_TILE_CAP.md; durable density-aware follow-up in MOE_GROUPED_GEMM_SCOPE.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 01:47:18 -04:00 · 2026-06-23 13:49:15 +00:00
parent 8925c009b7
commit 010067d900
2 changed files with 239 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch
@@ -0,0 +1,140 @@
+From 652b858252b354f4d4fb49e5ed7468eeee8e32fc Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 23 Jun 2026 15:47:06 +0200
+Subject: [PATCH] feat(paged): expert-aware MoE token-tile cap (patch 0014)
+
+On GB10 (sm_121) the Qwen3-30B-A3B-class mxfp4 MoE decode path already uses the
+sorted grouped FP4-MMA GEMM (MUL_MAT_ID -> ggml_cuda_mul_mat_q ids branch:
+mm_ids_helper moe_align/scatter + one persistent stream-k mul_mat_q), so the
+originally reported npl128 throughput cliff does NOT reproduce on this build.
+llama-batched-bench decode (S_TG t/s) is monotonic across batch:
+
+  npl        1     8    32    64   128   256
+  S_TG     85   282   629   935  1295  1779   (stock, mxfp4 MoE, -fa on)
+
+There is no knee to erase; the old cliff (a real high-batch regression, 620 t/s
+at npl128) was fixed upstream by grouped-mmq + MoE stream-k load balancing.
+
+What remains is a pure tile-shape micro-inefficiency. In mul_mat_q_case the
+token-tile width mmq_x is chosen to cover ncols_max (= ne12, the per-expert
+column upper bound = token count, up to 128) in one column-tile. At MoE decode
+the per-expert token density is ~ne12*k/n_experts (top-8 of 128 => ~1/16 of
+ne12, e.g. ~8 tokens/expert at npl128), so each expert's single mmq_x-wide
+col-tile is only ~6% filled: the MMA accumulator tile is mmq_x-wide at compile
+time and burns throughput on the padding columns while the larger y-tile lowers
+occupancy. Stock picks the LARGEST tile (128) where the SMALLEST tile that still
+covers the density would raise fill + occupancy at no extra weight read (at
+tokens/expert <= mmq_x there is exactly one non-empty col-tile per expert; the
+emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k
+kernel) - the inverse of vLLM's small per-expert BLOCK_SIZE_M.
+
+Add LLAMA_MOE_MMQ_X: an env cap on mmq_x for the MUL_MAT_ID path only
+(expert_bounds != nullptr). Default (unset or <= 0) = disabled, so the mmq_x
+selection, and therefore every kernel launched, is byte-identical to stock. The
+cap only ever lowers the loop's upper bound and still selects from the same
+granularity- and shared-memory-validated mmq_x set stock already uses for
+smaller batches, so no new kernel configuration is exercised.
+
+Measured on GB10, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, same binary,
+only LLAMA_MOE_MMQ_X differs (decode S_TG t/s / prefill S_PP t/s):
+
+  npl     stock S_TG   cap64 S_TG    d%     stock S_PP   cap64 S_PP
+   64        936          938      +0.1       2924         2883
+  128       1295         1357      +4.8       3075         3038
+  256       1784         1825      +2.3       3085         3046
+
+  (reproduced across interleaved reps; cap64 npl128 = 1357.5/1357.0, very stable)
+
+cap64 lifts high-batch decode +4.8% (npl128) / +2.3% (npl256), neutral at
+npl <= 64, for a consistent ~1.3% prefill cost. Smaller caps are net-negative:
+cap16 / cap32 crater prefill -41% / -17% (a 512-token prefill ubatch has ~32
+tokens/expert, which overflows a 16/32-wide tile into extra col-tiles + weight
+re-reads), so 64 is the recommended value and the only one that helps net.
+
+Honest framing: this is NOT a cliff fix (no cliff exists) and not a real-server
+throughput unlock (llama-server continuous batching already scales). It is a
+modest high-effective-batch DECODE micro-optimization that matches vLLM's
+smaller per-expert M-tiling, surfaced as an opt-in, default-off knob. The
+durable density-aware auto-select (drop the blunt global cap, choose mmq_x from
+ne_get_rows / n_active_experts so prefill keeps its large tile) is scoped in
+patches/paged/MOE_GROUPED_GEMM_SCOPE.md.
+
+Correctness: greedy temp-0 llama-server output with cap64 is byte-identical to
+stock for single-stream generation (fibonacci / capital-of-France / photosynthesis
+prompts) and stays coherent; batched-bench ran thousands of capped MoE matmuls at
+npl128/256 (mmq_x forced 128 -> 64) with no CUDA error / NaN and stable output.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/mmq.cuh | 37 ++++++++++++++++++++++++++++++++++++-
+ 1 file changed, 36 insertions(+), 1 deletion(-)
+
+diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
+index edf546d..cff608e 100644
+--- a/ggml/src/ggml-cuda/mmq.cuh
+++ b/ggml/src/ggml-cuda/mmq.cuh
+@@ -6,6 +6,7 @@
+ 
+ #include <climits>
+ #include <cstdint>
+#include <cstdlib>
+ 
+ using namespace ggml_cuda_mma;
+ 
+@@ -4052,6 +4053,18 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
+     }
+ }
+ 
+// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X.
+// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical).
+// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the
+// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case).
+static inline int ggml_cuda_moe_mmq_x_cap() {
+    static const int cap = []() -> int {
+        const char * s = getenv("LLAMA_MOE_MMQ_X");
+        return s ? atoi(s) : 0;
+    }();
+    return cap;
+}
+
+ template <ggml_type type>
+ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
+     const int    id     = ggml_cuda_get_device();
+@@ -4063,10 +4076,32 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
+     const int mmq_x_max = get_mmq_x_max_host(cc);
+     const int mmq_y = get_mmq_y_host(cc);
+ 
+    // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap.
+    // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are
+    // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count,
+    // up to 128) in a single column-tile. At MoE decode the per-expert token density is low
+    // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for
+    // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty:
+    // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the
+    // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the
+    // per-expert density raises tile fill + occupancy with no extra weight reads (at
+    // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the
+    // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel).
+    // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock;
+    // off the ids path the cap never applies.
+    int mmq_x_lim = mmq_x_max;
+    if (args.expert_bounds != nullptr) {
+        const int moe_cap = ggml_cuda_moe_mmq_x_cap();
+        if (moe_cap > 0) {
+            const int cap = moe_cap < 8 ? 8 : moe_cap;
+            mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max;
+        }
+    }
+
+     int mmq_x_best  = 0;
+     int ntiles_x_best = INT_MAX;
+ 
+-    for (int mmq_x = 8; mmq_x <= mmq_x_max && ntiles_x_best > 1; mmq_x += 8) {
+    for (int mmq_x = 8; mmq_x <= mmq_x_lim && ntiles_x_best > 1; mmq_x += 8) {
+         const int granularity = mmq_get_granularity_host(mmq_x, cc);
+ 
+         if (mmq_x % granularity != 0 || mmq_get_nbytes_shared<type>(mmq_x, mmq_y, cc, warp_size, nwarps) > smpbo) {
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/MOE_TOKEN_TILE_CAP.md
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_TOKEN_TILE_CAP.md
@@ -0,0 +1,99 @@
+# Patch 0014 findings: expert-aware MoE token-tile cap (LLAMA_MOE_MMQ_X)
+
+Near-term lever for the MoE-vs-vLLM workflow on GB10 (sm_121). Companion to
+`0014-paged-expert-aware-moe-token-tile-cap.patch`. Model:
+Qwen3-Coder-30B-A3B, 128 experts, top-8, mxfp4 experts
+(`~/bench/qwen3coder-mxfp4.gguf`). Dev tree `~/llama-paged-dev` (branch `paged`),
+`build-cuda` sm_121.
+
+## Headline (honest): there is no npl128 cliff to erase on this build
+
+The mission premise was a 25% decode drop at npl128 (batched-bench 253/505/830/620
+@ npl 8/32/64/128). It does **not** reproduce. Stock decode is monotonic:
+
+```
+llama-batched-bench, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, S_TG t/s
+  npl        1     8    32    64   128   256
+  stock     85   282   629   935  1295  1779     <- monotonic, no knee
+```
+
+The old cliff was a real high-batch regression since fixed upstream: mxfp4 MoE
+decode on GB10 already takes the sorted grouped FP4-MMA GEMM (MUL_MAT_ID ->
+`ggml_cuda_mul_mat_q` ids branch: `mm_ids_helper` moe_align/scatter + one
+persistent stream-k `mul_mat_q`), i.e. vLLM's algorithm. See
+`MOE_GROUPED_GEMM_SCOPE.md`.
+
+## What the knob does
+
+`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max`
+(= `ne12`, the per-expert column upper bound = token count, up to 128) in one
+column-tile. At MoE decode the per-expert density is `~ne12*k/n_experts`
+(top-8/128 => ~1/16 of `ne12`), so each expert's `mmq_x`-wide col-tile is only
+~6% filled: the MMA accumulator tile is `mmq_x`-wide at compile time and wastes
+throughput on the padding columns, and the larger y-tile lowers occupancy.
+
+`LLAMA_MOE_MMQ_X=<n>` caps `mmq_x` on the MUL_MAT_ID path only
+(`expert_bounds != nullptr`). It only lowers the selection-loop upper bound and
+still chooses from the same granularity/shared-memory-validated `mmq_x` set stock
+already uses for smaller batches - no new kernel configuration. Default
+(unset/<=0) = disabled => byte-identical to stock.
+
+## Measurements (same binary, only LLAMA_MOE_MMQ_X differs)
+
+Decode throughput, S_TG t/s:
+
+```
+  npl     stock   cap16   cap32   cap64
+   1       85      85      85      85
+   8      282     280     282     282
+  32      629     623     629     628
+  64      935     915     949     934
+ 128     1295    1204    1344    1357     <- cap64 +4.8% (cap16 -7%)
+ 256     1779    1370    1723    1820     <- cap64 +2.3% (cap16 -23%)
+```
+
+Prefill throughput, S_PP t/s (the cost):
+
+```
+  npl     stock   cap16   cap32   cap64
+ 128     3083    1817    2559    3038
+ 256     3084    1818    2560    3046
+                 -41%    -17%    -1.3%
+```
+
+Reproducibility (interleaved off/cap64, two reps each):
+
+```
+  npl    off rep1/rep2   cap64 rep1/rep2
+  128    1300 / 1290     1357.5 / 1357.0
+  256    1786 / 1782     1826.3 / 1824.5
+```
+
+cap64 is stable to <0.1% and the gain sits well above the ~1% run-to-run band.
+
+## Why 64 is the only value that helps net
+
+A 512-token prefill ubatch routes ~32 tokens/expert. cap16/cap32 force those into
+16/32-wide tiles, overflowing into extra col-tiles + weight re-reads -> prefill
+craters (-41% / -17%). cap64 still holds the prefill density in one tile (32 < 64)
+so prefill is near-neutral (-1.3%), while decode (~8 tokens/expert at npl128) gets
+the fuller, higher-occupancy tile.
+
+## Verdict
+
+- Real but **modest** high-effective-batch DECODE micro-optimization
+  (+4.8% npl128, +2.3% npl256), neutral at npl<=64, ~1.3% prefill cost at cap64.
+- **Not** a cliff fix (no cliff) and **not** a real-server unlock (llama-server
+  continuous batching already scales). Shipped as an opt-in, default-off knob;
+  recommended value 64 for decode-heavy high-concurrency deployments.
+- Correctness: greedy temp-0 server output with cap64 is byte-identical to stock
+  for single-stream generation and stays coherent; thousands of capped MoE
+  matmuls at npl128/256 ran with no CUDA error / NaN.
+
+## Durable follow-up (scoped, not implemented)
+
+Replace the blunt global cap with a density-aware auto-select: choose `mmq_x`
+from `ne_get_rows / n_active_experts` inside `mul_mat_q_case` so decode gets the
+small tile while prefill keeps its large tile automatically (removes the ~1.3%
+prefill cost). Plus the block-padded `moe_align` in `mm_ids_helper`. See
+`MOE_GROUPED_GEMM_SCOPE.md`.