From 010067d900f1c3f9582198970913a157a800a8ae Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 23 Jun 2026 13:49:15 +0000
Subject: [PATCH] feat(paged): mirror patch 0014 - expert-aware MoE token-tile
 cap

Mirror of the dev-tree engine patch (ggml mmq.cuh) into the paged patch set,
plus its measurement writeup. Adds LLAMA_MOE_MMQ_X, an opt-in env cap on the MoE
grouped-GEMM token-tile (mmq_x) for the MUL_MAT_ID path; default-off =
byte-identical to stock.

Honest result of the MoE near-term lever: the npl128 decode cliff does NOT exist
on current HEAD (stock decode is monotonic 85/282/629/935/1295/1779 t/s at npl
1/8/32/64/128/256; the old cliff was fixed upstream by the sorted grouped
FP4-MMA GEMM + MoE stream-k). The cap is therefore not a cliff fix but a modest
high-batch decode micro-optimization: cap64 gives +4.8% decode at npl128 and
+2.3% at npl256 (reproducible, neutral at npl<=64) for a ~1.3% prefill cost;
cap16/cap32 are net-negative (prefill -41% / -17%). Full tables in
MOE_TOKEN_TILE_CAP.md; durable density-aware follow-up in
MOE_GROUPED_GEMM_SCOPE.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
 ...aged-expert-aware-moe-token-tile-cap.patch | 140 ++++++++++++++++++
 .../patches/paged/MOE_TOKEN_TILE_CAP.md       |  99 +++++++++++++
 2 files changed, 239 insertions(+)
 create mode 100644 backend/cpp/llama-cpp/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch
 create mode 100644 backend/cpp/llama-cpp/patches/paged/MOE_TOKEN_TILE_CAP.md

diff --git a/backend/cpp/llama-cpp/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch b/backend/cpp/llama-cpp/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch
new file mode 100644
index 000000000..fc9ff66b5
--- /dev/null
+++ b/backend/cpp/llama-cpp/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch
@@ -0,0 +1,140 @@
+From 652b858252b354f4d4fb49e5ed7468eeee8e32fc Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 23 Jun 2026 15:47:06 +0200
+Subject: [PATCH] feat(paged): expert-aware MoE token-tile cap (patch 0014)
+
+On GB10 (sm_121) the Qwen3-30B-A3B-class mxfp4 MoE decode path already uses the
+sorted grouped FP4-MMA GEMM (MUL_MAT_ID -> ggml_cuda_mul_mat_q ids branch:
+mm_ids_helper moe_align/scatter + one persistent stream-k mul_mat_q), so the
+originally reported npl128 throughput cliff does NOT reproduce on this build.
+llama-batched-bench decode (S_TG t/s) is monotonic across batch:
+
+  npl        1     8    32    64   128   256
+  S_TG     85   282   629   935  1295  1779   (stock, mxfp4 MoE, -fa on)
+
+There is no knee to erase; the old cliff (a real high-batch regression, 620 t/s
+at npl128) was fixed upstream by grouped-mmq + MoE stream-k load balancing.
+
+What remains is a pure tile-shape micro-inefficiency. In mul_mat_q_case the
+token-tile width mmq_x is chosen to cover ncols_max (= ne12, the per-expert
+column upper bound = token count, up to 128) in one column-tile. At MoE decode
+the per-expert token density is ~ne12*k/n_experts (top-8 of 128 => ~1/16 of
+ne12, e.g. ~8 tokens/expert at npl128), so each expert's single mmq_x-wide
+col-tile is only ~6% filled: the MMA accumulator tile is mmq_x-wide at compile
+time and burns throughput on the padding columns while the larger y-tile lowers
+occupancy. Stock picks the LARGEST tile (128) where the SMALLEST tile that still
+covers the density would raise fill + occupancy at no extra weight read (at
+tokens/expert <= mmq_x there is exactly one non-empty col-tile per expert; the
+emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k
+kernel) - the inverse of vLLM's small per-expert BLOCK_SIZE_M.
+
+Add LLAMA_MOE_MMQ_X: an env cap on mmq_x for the MUL_MAT_ID path only
+(expert_bounds != nullptr). Default (unset or <= 0) = disabled, so the mmq_x
+selection, and therefore every kernel launched, is byte-identical to stock. The
+cap only ever lowers the loop's upper bound and still selects from the same
+granularity- and shared-memory-validated mmq_x set stock already uses for
+smaller batches, so no new kernel configuration is exercised.
+
+Measured on GB10, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, same binary,
+only LLAMA_MOE_MMQ_X differs (decode S_TG t/s / prefill S_PP t/s):
+
+  npl     stock S_TG   cap64 S_TG    d%     stock S_PP   cap64 S_PP
+   64        936          938      +0.1       2924         2883
+  128       1295         1357      +4.8       3075         3038
+  256       1784         1825      +2.3       3085         3046
+
+  (reproduced across interleaved reps; cap64 npl128 = 1357.5/1357.0, very stable)
+
+cap64 lifts high-batch decode +4.8% (npl128) / +2.3% (npl256), neutral at
+npl <= 64, for a consistent ~1.3% prefill cost. Smaller caps are net-negative:
+cap16 / cap32 crater prefill -41% / -17% (a 512-token prefill ubatch has ~32
+tokens/expert, which overflows a 16/32-wide tile into extra col-tiles + weight
+re-reads), so 64 is the recommended value and the only one that helps net.
+
+Honest framing: this is NOT a cliff fix (no cliff exists) and not a real-server
+throughput unlock (llama-server continuous batching already scales). It is a
+modest high-effective-batch DECODE micro-optimization that matches vLLM's
+smaller per-expert M-tiling, surfaced as an opt-in, default-off knob. The
+durable density-aware auto-select (drop the blunt global cap, choose mmq_x from
+ne_get_rows / n_active_experts so prefill keeps its large tile) is scoped in
+patches/paged/MOE_GROUPED_GEMM_SCOPE.md.
+
+Correctness: greedy temp-0 llama-server output with cap64 is byte-identical to
+stock for single-stream generation (fibonacci / capital-of-France / photosynthesis
+prompts) and stays coherent; batched-bench ran thousands of capped MoE matmuls at
+npl128/256 (mmq_x forced 128 -> 64) with no CUDA error / NaN and stable output.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/mmq.cuh | 37 ++++++++++++++++++++++++++++++++++++-
+ 1 file changed, 36 insertions(+), 1 deletion(-)
+
+diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
+index edf546d..cff608e 100644
+--- a/ggml/src/ggml-cuda/mmq.cuh
++++ b/ggml/src/ggml-cuda/mmq.cuh
+@@ -6,6 +6,7 @@
+ 
+ #include <climits>
+ #include <cstdint>
++#include <cstdlib>
+ 
+ using namespace ggml_cuda_mma;
+ 
+@@ -4052,6 +4053,18 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
+     }
+ }
+ 
++// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X.
++// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical).
++// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the
++// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case).
++static inline int ggml_cuda_moe_mmq_x_cap() {
++    static const int cap = []() -> int {
++        const char * s = getenv("LLAMA_MOE_MMQ_X");
++        return s ? atoi(s) : 0;
++    }();
++    return cap;
++}
++
+ template <ggml_type type>
+ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
+     const int    id     = ggml_cuda_get_device();
+@@ -4063,10 +4076,32 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
+     const int mmq_x_max = get_mmq_x_max_host(cc);
+     const int mmq_y = get_mmq_y_host(cc);
+ 
++    // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap.
++    // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are
++    // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count,
++    // up to 128) in a single column-tile. At MoE decode the per-expert token density is low
++    // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for
++    // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty:
++    // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the
++    // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the
++    // per-expert density raises tile fill + occupancy with no extra weight reads (at
++    // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the
++    // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel).
++    // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock;
++    // off the ids path the cap never applies.
++    int mmq_x_lim = mmq_x_max;
++    if (args.expert_bounds != nullptr) {
++        const int moe_cap = ggml_cuda_moe_mmq_x_cap();
++        if (moe_cap > 0) {
++            const int cap = moe_cap < 8 ? 8 : moe_cap;
++            mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max;
++        }
++    }
++
+     int mmq_x_best  = 0;
+     int ntiles_x_best = INT_MAX;
+ 
+-    for (int mmq_x = 8; mmq_x <= mmq_x_max && ntiles_x_best > 1; mmq_x += 8) {
++    for (int mmq_x = 8; mmq_x <= mmq_x_lim && ntiles_x_best > 1; mmq_x += 8) {
+         const int granularity = mmq_get_granularity_host(mmq_x, cc);
+ 
+         if (mmq_x % granularity != 0 || mmq_get_nbytes_shared<type>(mmq_x, mmq_y, cc, warp_size, nwarps) > smpbo) {
+-- 
+2.43.0
+
diff --git a/backend/cpp/llama-cpp/patches/paged/MOE_TOKEN_TILE_CAP.md b/backend/cpp/llama-cpp/patches/paged/MOE_TOKEN_TILE_CAP.md
new file mode 100644
index 000000000..88602291d
--- /dev/null
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_TOKEN_TILE_CAP.md
@@ -0,0 +1,99 @@
+# Patch 0014 findings: expert-aware MoE token-tile cap (LLAMA_MOE_MMQ_X)
+
+Near-term lever for the MoE-vs-vLLM workflow on GB10 (sm_121). Companion to
+`0014-paged-expert-aware-moe-token-tile-cap.patch`. Model:
+Qwen3-Coder-30B-A3B, 128 experts, top-8, mxfp4 experts
+(`~/bench/qwen3coder-mxfp4.gguf`). Dev tree `~/llama-paged-dev` (branch `paged`),
+`build-cuda` sm_121.
+
+## Headline (honest): there is no npl128 cliff to erase on this build
+
+The mission premise was a 25% decode drop at npl128 (batched-bench 253/505/830/620
+@ npl 8/32/64/128). It does **not** reproduce. Stock decode is monotonic:
+
+```
+llama-batched-bench, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, S_TG t/s
+  npl        1     8    32    64   128   256
+  stock     85   282   629   935  1295  1779     <- monotonic, no knee
+```
+
+The old cliff was a real high-batch regression since fixed upstream: mxfp4 MoE
+decode on GB10 already takes the sorted grouped FP4-MMA GEMM (MUL_MAT_ID ->
+`ggml_cuda_mul_mat_q` ids branch: `mm_ids_helper` moe_align/scatter + one
+persistent stream-k `mul_mat_q`), i.e. vLLM's algorithm. See
+`MOE_GROUPED_GEMM_SCOPE.md`.
+
+## What the knob does
+
+`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max`
+(= `ne12`, the per-expert column upper bound = token count, up to 128) in one
+column-tile. At MoE decode the per-expert density is `~ne12*k/n_experts`
+(top-8/128 => ~1/16 of `ne12`), so each expert's `mmq_x`-wide col-tile is only
+~6% filled: the MMA accumulator tile is `mmq_x`-wide at compile time and wastes
+throughput on the padding columns, and the larger y-tile lowers occupancy.
+
+`LLAMA_MOE_MMQ_X=<n>` caps `mmq_x` on the MUL_MAT_ID path only
+(`expert_bounds != nullptr`). It only lowers the selection-loop upper bound and
+still chooses from the same granularity/shared-memory-validated `mmq_x` set stock
+already uses for smaller batches - no new kernel configuration. Default
+(unset/<=0) = disabled => byte-identical to stock.
+
+## Measurements (same binary, only LLAMA_MOE_MMQ_X differs)
+
+Decode throughput, S_TG t/s:
+
+```
+  npl     stock   cap16   cap32   cap64
+   1       85      85      85      85
+   8      282     280     282     282
+  32      629     623     629     628
+  64      935     915     949     934
+ 128     1295    1204    1344    1357     <- cap64 +4.8% (cap16 -7%)
+ 256     1779    1370    1723    1820     <- cap64 +2.3% (cap16 -23%)
+```
+
+Prefill throughput, S_PP t/s (the cost):
+
+```
+  npl     stock   cap16   cap32   cap64
+ 128     3083    1817    2559    3038
+ 256     3084    1818    2560    3046
+                 -41%    -17%    -1.3%
+```
+
+Reproducibility (interleaved off/cap64, two reps each):
+
+```
+  npl    off rep1/rep2   cap64 rep1/rep2
+  128    1300 / 1290     1357.5 / 1357.0
+  256    1786 / 1782     1826.3 / 1824.5
+```
+
+cap64 is stable to <0.1% and the gain sits well above the ~1% run-to-run band.
+
+## Why 64 is the only value that helps net
+
+A 512-token prefill ubatch routes ~32 tokens/expert. cap16/cap32 force those into
+16/32-wide tiles, overflowing into extra col-tiles + weight re-reads -> prefill
+craters (-41% / -17%). cap64 still holds the prefill density in one tile (32 < 64)
+so prefill is near-neutral (-1.3%), while decode (~8 tokens/expert at npl128) gets
+the fuller, higher-occupancy tile.
+
+## Verdict
+
+- Real but **modest** high-effective-batch DECODE micro-optimization
+  (+4.8% npl128, +2.3% npl256), neutral at npl<=64, ~1.3% prefill cost at cap64.
+- **Not** a cliff fix (no cliff) and **not** a real-server unlock (llama-server
+  continuous batching already scales). Shipped as an opt-in, default-off knob;
+  recommended value 64 for decode-heavy high-concurrency deployments.
+- Correctness: greedy temp-0 server output with cap64 is byte-identical to stock
+  for single-stream generation and stays coherent; thousands of capped MoE
+  matmuls at npl128/256 ran with no CUDA error / NaN.
+
+## Durable follow-up (scoped, not implemented)
+
+Replace the blunt global cap with a density-aware auto-select: choose `mmq_x`
+from `ne_get_rows / n_active_experts` inside `mul_mat_q_case` so decode gets the
+small tile while prefill keeps its large tile automatically (removes the ~1.3%
+prefill cost). Plus the block-padded `moe_align` in `mm_ids_helper`. See
+`MOE_GROUPED_GEMM_SCOPE.md`.