feat(paged): mirror MoE token-tile density-aware auto-select (patch 0015)

Mirror of llama-paged-dev commit 151343b into the pinned paged patch series. The durable, default-on follow-up to patch 0014's opt-in LLAMA_MOE_MMQ_X global cap: a host-side density-aware mmq_x auto-select in mul_mat_q_case that caps the MUL_MAT_ID grouped FP4-MMA token-tile only at low per-expert density (decode) and keeps the 128 tile at high density (prefill), so it is prefill-safe by construction (removes 0014's ~1.3% prefill cost). No new kernel. density_max default = 8 (not tile/4 = 16): 16 equals the 256-expert prefill-ubatch density and regressed S_PP ~2% on Qwen3.6-35B-A3B NVFP4; 8 sits between decode and prefill density for n_experts in [128,511] at n_ubatch=512. Honest result on the mission's MoE target (Qwen3.6-35B-A3B NVFP4, 256 experts + GDN/SSM linear attention, GB10 sm_121, median of 5 reps): NEUTRAL. Decode S_TG is within run-to-run noise (npl128 +0.36%) and prefill S_PP neutral (within +/-0.7%). This model is bound by the SSM recurrence and 256-tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile lever has nothing to bite on; a npl128 tile sweep confirms 64 is the only useful width (TILE8 -6.3% ... TILE96 -0.8%). The lever's real win lives on col-tile-bound MoE (Qwen3-Coder-30B, +4.8% @npl128 per patch 0014), which the auto-select reproduces at npl128 by construction at zero prefill cost. Shipped default-on because it is prefill-safe, decode-neutral here, and correctness-gated. LLAMA_MOE_MMQ_X (0014) kept as a manual override; LLAMA_MOE_AUTO_TILE=0 restores exact stock selection. P0 gate: test-backend-ops test_mul_mat_id ragged small-M NVFP4/MXFP4 MoE decode-density shapes pass CUDA-vs-CPU on GB10 both default-on and stock. Full rationale and tables in patches/paged/MOE_DENSITY_AUTO_TILE.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-23 16:19:07 -04:00 · 2026-06-23 19:04:55 +00:00
parent 010067d900
commit acb22a66ed
2 changed files with 381 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch
@@ -0,0 +1,238 @@
+From 151343bc8c7b956c99eafc855704b70d44637a3b Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 23 Jun 2026 21:03:00 +0200
+Subject: [PATCH] feat(paged): expert-density-aware MoE token-tile auto-select
+ (patch 0015)
+
+The durable follow-up to patch 0014's blunt LLAMA_MOE_MMQ_X global cap (which the
+0014 doc itself scoped): replace the manual env cap with a host-side, default-on
+auto-select inside mul_mat_q_case that picks a small token-tile (mmq_x) for the
+MUL_MAT_ID grouped FP4-MMA GEMM only when the per-expert token density is low
+(decode), and keeps the large 128-wide tile when density is high (prefill). No new
+kernel: the selection only lowers the loop's upper bound to an already-compiled,
+granularity- and shared-memory-validated mmq_x.
+
+Density is estimated host-side from the args the ids path already passes:
+  ne_get_rows = ncols_dst   = ne12 * n_expert_used   (token-expert assignments)
+  n_experts   = nchannels_x = ne02
+  density     = ceil(ne_get_rows / min(ne_get_rows, n_experts))   (tokens/expert)
+Cap to the small tile (default 64) only when density <= density_max. Unlike 0014's
+global cap, the high-density prefill ubatch stays on the big tile, so S_PP does not
+regress by construction.
+
+density_max default = 8 (not tile/4 = 16). The cap must fire for decode but not for
+a prefill ubatch, and each has per-expert density n_tokens*n_used/n_experts. At the
+standard n_ubatch=512, n_used=8: prefill density = 4096/n_experts (32 at 128 experts,
+16 at 256), decode at npl<=128 is <= 1024/n_experts (8 at 128, 4 at 256). Default 8
+sits strictly between for every n_experts in [128,511], so it caps decode and leaves
+prefill on the big tile. tile/4 (=16) equalled the 256-expert prefill density and
+cratered its S_PP by ~2%, the regression this threshold exists to avoid.
+
+Measured on GB10 (sm_121), Qwen3.6-35B-A3B NVFP4 (256 experts, top-8, GDN linear
+attention), llama-batched-bench -fa on -npp 128 -ntg 128, default-on vs stock
+(LLAMA_MOE_AUTO_TILE=0), median of 5 reps:
+
+  npl   S_TG stock  S_TG 0015   dTG%    S_PP stock  S_PP 0015   dPP%
+    8      183.59     183.18  -0.22%       1489.2     1500.1  +0.73%
+   32      264.02     263.44  -0.22%       2034.5     2033.5  -0.05%
+   64      311.76     310.41  -0.43%       2028.3     2027.6  -0.03%
+  128      336.10     337.32  +0.36%       2025.0     2027.7  +0.13%
+
+Honest read: on THIS model the decode effect is within run-to-run noise (neutral)
+and prefill is neutral. q36-35b-a3b decode is bound by the GDN/SSM recurrence and
+256 tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile
+lever (worth +4.8% @npl128 on Qwen3-Coder-30B, 128 larger experts, patch 0014
+cap64) does not move it. A npl128 tile sweep on this model confirms 64 is the only
+useful width (TILE8 -6.3%, TILE16 -3.2%, TILE32 -0.2%, TILE64 +0.7%, TILE96 -0.8%):
+smaller tiles lose to grid/scheduling overhead and the FP4-MMA minimum width.
+
+Value banked default-on: (1) removes 0014's ~1.3% prefill cost by construction
+(density-gated, not global); (2) auto-selects the small tile for col-tile-bound MoE
+decode, reproducing 0014 cap64's tile=64 at npl128 by construction, so it preserves
+the +4.8% on Qwen3-Coder-30B without the prefill cost; (3) prefill-safe and decode-
+neutral on the SSM model, harmless where it does not help. Conservative by design:
+at npl256 the qwen3coder decode density (16) equals the 256-expert prefill density
+(16), indistinguishable to a pure-density gate, so density_max=8 forgoes 0014's
+2.3% @npl256 to keep 256-expert prefill safe; an ne12-aware refinement is future
+work.
+
+LLAMA_MOE_MMQ_X (patch 0014) is KEPT as a manual override that, when > 0, forces the
+old blunt global cap and bypasses the auto-select (explicit A/B knob). The auto-
+select is the default; LLAMA_MOE_AUTO_TILE=0 restores exact stock mmq_x selection.
+LLAMA_MOE_DECODE_TILE / LLAMA_MOE_DENSITY_MAX tune the small tile / threshold.
+
+Correctness: extends tests/test-backend-ops test_mul_mat_id with a ragged small-M
+NVFP4/MXFP4 MoE decode-density gate (128 experts, top-8, m=768, k=2048, n in
+{16,33,64,128,130,200,256,512} spanning the cap boundary and ragged token counts).
+All 16 shapes pass CUDA-vs-CPU oracle on GB10 both default-on and with
+LLAMA_MOE_AUTO_TILE=0; full MUL_MAT_ID suite 2/2 backends OK. Off the ids path
+nothing changes (non-MoE mul_mat byte-identical to stock).
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/mmq.cuh | 100 ++++++++++++++++++++++++++++++-------
+ tests/test-backend-ops.cpp |  16 ++++++
+ 2 files changed, 99 insertions(+), 17 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
+index cff608e..9718b12 100644
+--- a/ggml/src/ggml-cuda/mmq.cuh
+++ b/ggml/src/ggml-cuda/mmq.cuh
+@@ -4053,10 +4053,11 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
+     }
+ }
+ 
+-// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X.
+-// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical).
+-// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the
+-// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case).
+// [paged patch 0014] MoE token-tile (mmq_x) MANUAL cap, read once from env LLAMA_MOE_MMQ_X.
+// Returns 0 when unset / non-positive => disabled (fall through to the patch-0015 auto-select).
+// When > 0 it forces a blunt GLOBAL cap on the per-expert column-tile width for the MUL_MAT_ID
+// grouped-GEMM path (decode AND prefill), overriding the density-aware auto-select below. Kept
+// as an explicit override / A-B knob; the default path is now the auto-select.
+ static inline int ggml_cuda_moe_mmq_x_cap() {
+     static const int cap = []() -> int {
+         const char * s = getenv("LLAMA_MOE_MMQ_X");
+@@ -4065,6 +4066,43 @@ static inline int ggml_cuda_moe_mmq_x_cap() {
+     return cap;
+ }
+ 
+// [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select knobs (DEFAULT-ON).
+// LLAMA_MOE_AUTO_TILE=0 disables the auto-select => exact stock mmq_x selection.
+static inline bool ggml_cuda_moe_auto_tile_enabled() {
+    static const bool en = []() -> bool {
+        const char * s = getenv("LLAMA_MOE_AUTO_TILE");
+        return !(s && atoi(s) == 0);
+    }();
+    return en;
+}
+// The small high-occupancy token-tile chosen for low-density (decode) MoE matmuls. Default 64:
+// the measured GB10 sweet spot (full per-expert fill with >=4x routing-imbalance headroom).
+static inline int ggml_cuda_moe_decode_tile() {
+    static const int t = []() -> int {
+        const char * s = getenv("LLAMA_MOE_DECODE_TILE");
+        const int v = s ? atoi(s) : 0;
+        return v >= 8 ? v : 64;
+    }();
+    return t;
+}
+// Per-expert token-density ceiling under which the small tile is selected. Default 8: the cap must
+// fire for decode but NOT for a prefill ubatch, and the per-expert density of each is
+// n_tokens*n_used/n_experts. For the standard n_ubatch=512, n_used=8 the prefill density is
+// 4096/n_experts (= 32 at 128 experts, 16 at 256 experts); decode at npl<=128 is <=1024/n_experts
+// (= 8 at 128 experts, 4 at 256). Default 8 sits strictly between the two for every n_experts in
+// [128,511], so it caps decode and leaves the prefill ubatch on the big 128 tile - whereas the old
+// tile/4 (=16) equalled the 256-expert prefill density and cratered its S_PP by ~2% (measured on
+// Qwen3.6-35B-A3B NVFP4). 8 also keeps >=8x fill headroom at tile 64 so an imbalanced expert
+// segment never splits into an extra col-tile.
+static inline int ggml_cuda_moe_density_max() {
+    static const int d = []() -> int {
+        const char * s = getenv("LLAMA_MOE_DENSITY_MAX");
+        const int v = s ? atoi(s) : 0;
+        return v > 0 ? v : 8;
+    }();
+    return d;
+}
+
+ template <ggml_type type>
+ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
+     const int    id     = ggml_cuda_get_device();
+@@ -4076,25 +4114,53 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
+     const int mmq_x_max = get_mmq_x_max_host(cc);
+     const int mmq_y = get_mmq_y_host(cc);
+ 
+-    // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap.
+-    // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are
+-    // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count,
+-    // up to 128) in a single column-tile. At MoE decode the per-expert token density is low
+-    // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for
+-    // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty:
+-    // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the
+-    // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the
+-    // per-expert density raises tile fill + occupancy with no extra weight reads (at
+-    // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the
+-    // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel).
+-    // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock;
+-    // off the ids path the cap never applies.
+    // [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON).
+    // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens
+    // sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count, up to 128)
+    // in a single column-tile, i.e. it MAXIMIZES the tile (128 on Blackwell) for the aggregate
+    // batch. But the tile is then applied PER EXPERT, and at MoE decode the per-expert token
+    // density is tiny (top-k of many experts), so each expert's single 128-wide col-tile is mostly
+    // empty: the MMA accumulator tile is mmq_x-wide at compile time and burns throughput on the
+    // padding columns while the larger y-tile lowers occupancy. vLLM's fused-MoE does the opposite
+    // (a small per-expert BLOCK_SIZE_M). We reproduce that here, host-side only, by picking a
+    // SMALLER mmq_x when - and only when - the per-expert density is low:
+    //
+    //   ne_get_rows  = args.ncols_dst    = ne12 * n_expert_used  (total token-expert assignments)
+    //   n_experts    = args.nchannels_x  = ne02
+    //   n_active_est = min(n_experts, ne_get_rows)               (upper bound on active experts)
+    //   density      = ceil(ne_get_rows / n_active_est)          (avg tokens per active expert)
+    //
+    // Cap to the small tile (default 64) only when density <= density_max (default 8). 8 sits below
+    // every prefill-ubatch density and above every decode density for n_experts in [128,511] at the
+    // standard n_ubatch=512 (prefill 4096/n_experts, decode <=1024/n_experts), with >=8x fill headroom
+    // so a capped expert segment never splits a col-tile. Decode (per-expert density 4 at 256 experts,
+    // 8 at 128 experts @npl128) gets the fuller high-occupancy tile; the prefill ubatch (density 16 at
+    // 256 / 32 at 128 experts) stays ABOVE the threshold and keeps the big
+    // 128 compute tile - so unlike the blunt global cap (LLAMA_MOE_MMQ_X / patch 0014) this is
+    // prefill-safe by construction. The selection only ever picks an already-compiled, granularity-
+    // and shared-memory-validated mmq_x that the loop below would consider for a smaller batch; no
+    // new kernel. Off the ids path (expert_bounds == nullptr) nothing changes => non-MoE mul_mat
+    // and the gated f16/bf16 host-loop fallback stay byte-identical to stock.
+    //   - LLAMA_MOE_MMQ_X=<n>   : manual blunt global cap, overrides the auto-select (patch 0014).
+    //   - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection).
+    //   - LLAMA_MOE_DECODE_TILE=<n>, LLAMA_MOE_DENSITY_MAX=<n> : tune the tile / threshold.
+     int mmq_x_lim = mmq_x_max;
+     if (args.expert_bounds != nullptr) {
+         const int moe_cap = ggml_cuda_moe_mmq_x_cap();
+         if (moe_cap > 0) {
+             const int cap = moe_cap < 8 ? 8 : moe_cap;
+             mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max;
+        } else if (ggml_cuda_moe_auto_tile_enabled()) {
+            const int64_t ne_get_rows = args.ncols_dst;
+            const int64_t n_experts   = args.nchannels_x;
+            if (ne_get_rows > 0 && n_experts > 0) {
+                const int64_t n_active = ne_get_rows < n_experts ? ne_get_rows : n_experts;
+                const int64_t density  = (ne_get_rows + n_active - 1) / n_active;
+                const int     tile     = ggml_cuda_moe_decode_tile();
+                if (density <= (int64_t) ggml_cuda_moe_density_max() && tile < mmq_x_max) {
+                    mmq_x_lim = tile;
+                }
+            }
+         }
+     }
+ 
+diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
+index 15ae389..f219309 100644
+--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
+@@ -8575,6 +8575,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
+     // gpt-oss issue with Vulkan mmq_id
+     test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_MXFP4, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880));
+ 
+    // [paged P0] MXFP4/NVFP4 qwen3-30b-a3b MoE decode-density regression gate for the expert-
+    // density-aware mmq_x auto-select (patch 0015). Real expert-FFN slice (128 experts, top-8,
+    // m=768, k=2048) so this exercises the exact grouped FP4-MMA mmq kernel the model runs.
+    // Per-expert token density = n*n_used/n_mats = n/16; cover the decode band (density 1/4/8/16
+    // at n 16/64/128/256), ragged token counts (n 33/130/200: experts with 0/1/2 tokens, n not a
+    // multiple of the tile) where the tiny-M col-tiles change geometry and any masking can leak,
+    // and a prefill-density shape (n 512 => density 32) the auto-select must leave on the large
+    // 128 tile. n>=128 is exactly where stock picks mmq_x=128 and the auto-select picks 64, so the
+    // op-test (CPU oracle vs CUDA, deterministic) is the bit-exact regression gate for P1: it must
+    // pass with the auto-select on (default) and with LLAMA_MOE_AUTO_TILE=0 (stock selection).
+    for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) {
+        for (int n : {16, 33, 64, 128, 130, 200, 256, 512}) {
+            test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 128, 8, false, 768, n, 2048));
+        }
+    }
+
+     for (ggml_type type_a : all_types) {
+         test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a)));
+     }
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/MOE_DENSITY_AUTO_TILE.md
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_DENSITY_AUTO_TILE.md
@@ -0,0 +1,143 @@
+# Patch 0015 findings: expert-density-aware MoE token-tile auto-select
+
+The durable follow-up to patch 0014 (`MOE_TOKEN_TILE_CAP.md`): replace the blunt,
+opt-in `LLAMA_MOE_MMQ_X` global cap with a host-side, **default-on** density-aware
+`mmq_x` auto-select in `mul_mat_q_case`. Companion to
+`0015-paged-expert-density-aware-moe-token-tile-auto-select.patch`. Dev tree
+`~/llama-paged-dev` (branch `paged`), `build-cuda` sm_121.
+
+Primary model: **Qwen3.6-35B-A3B NVFP4** (`~/bench/q36-35b-a3b-nvfp4.gguf`),
+**256 experts, top-8**, expert FFN 512, GDN linear attention (SSM inner 4096),
+41 layers. This is a different beast from 0014's Qwen3-Coder-30B-A3B (128 experts,
+larger expert FFN, standard attention).
+
+## What it does (vs 0014)
+
+`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max` (= `ne12`,
+the per-expert column upper bound = token count) in one column-tile, i.e. stock
+**maximizes** the tile (128 on Blackwell). Applied per expert at MoE decode, where
+per-expert density is tiny, that 128-wide tile is mostly padding.
+
+Patch 0014 capped `mmq_x` globally on the ids path via `LLAMA_MOE_MMQ_X` (decode
+**and** prefill), which cost ~1.3% prefill. Patch 0015 instead estimates the
+per-expert density host-side, from args the ids path already passes:
+
+```
+ne_get_rows = ncols_dst   = ne12 * n_expert_used        (token-expert assignments)
+n_experts   = nchannels_x = ne02
+density     = ceil(ne_get_rows / min(ne_get_rows, n_experts))   (tokens/expert)
+```
+
+and caps to the small tile (default 64) **only when `density <= density_max`**, so
+the high-density prefill ubatch keeps the big 128 tile. Prefill-safe by construction.
+No new kernel: the selection only lowers the loop's upper bound to an
+already-compiled, granularity- and shared-memory-validated `mmq_x`.
+
+## The threshold matters: `density_max = 8`, not `tile/4 = 16`
+
+The cap must fire for decode but not for a prefill ubatch. Each has per-expert
+density `n_tokens * n_used / n_experts`. At the standard `n_ubatch=512`, `n_used=8`:
+
+```
+                       128 experts   256 experts
+prefill ubatch (512)        32            16
+decode npl128 (128)          8             4
+```
+
+`tile/4 = 16` (0014's first auto-select draft default) **equals the 256-expert
+prefill density** and caps prefill: measured -2.0% to -2.9% S_PP on q36-35b-a3b.
+`density_max = 8` sits strictly between decode and prefill for every `n_experts` in
+`[128, 511]`, so it caps decode and leaves prefill on the big tile. This single
+default change is what makes the patch prefill-safe on the 256-expert model.
+
+## Measurements (default-on vs stock, median of 5 reps)
+
+`llama-batched-bench`, q36-35b-a3b-nvfp4.gguf, `-fa on -npp 128 -ntg 128`, GB10
+sm_121. STOCK = `LLAMA_MOE_AUTO_TILE=0` (exact stock selection); 0015 = default.
+
+```
+  npl   S_TG stock  S_TG 0015   dTG%     S_PP stock  S_PP 0015   dPP%
+    8      183.59     183.18  -0.22%        1489.2     1500.1  +0.73%
+   32      264.02     263.44  -0.22%        2034.5     2033.5  -0.05%
+   64      311.76     310.41  -0.43%        2028.3     2027.6  -0.03%
+  128      336.10     337.32  +0.36%        2025.0     2027.7  +0.13%
+```
+
+Raw npl128 reps: S_TG 0015 `[337.3, 336.9, 336.4, 338.9, 338.1]` vs stock
+`[336.2, 336.1, 335.9, 336.9, 335.8]` (distributions overlap); S_PP 0015
+`[2028.6, 2023.0, 2024.9, 2028.0, 2027.7]` vs stock `[2024.9, 2025.0, 2023.2,
+2029.4, 2029.0]`.
+
+### Honest read: neutral on this model
+
+On q36-35b-a3b the decode delta is **within run-to-run noise** (npl128 +0.36%,
+npl<=64 slightly negative) and prefill is **neutral** (within +/-0.7%, well inside
+the 1% target). The `+5%` decode target from the localmaxxing reference does **not**
+materialize here. q36-35b-a3b decode is bound by the GDN/SSM recurrence and
+256-tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile
+lever has nothing to bite on.
+
+### npl128 decode tile sweep confirms 64 is the only useful width
+
+`density_max=8` fixed, varying `LLAMA_MOE_DECODE_TILE`, S_TG @ npl128 vs stock:
+
+```
+  TILE8   TILE16  TILE32  TILE64  TILE96
+ -6.31%   -3.18%  -0.17%  +0.70%  -0.76%
+```
+
+Smaller tiles are **worse**, not better: more column-tiles per expert = more
+grid/scheduling overhead, and the FP4-MMA has a minimum efficient width. So matching
+the tile to the literal density (4) is counterproductive; 64 is the sweet spot,
+same as 0014.
+
+## Why ship it default-on anyway
+
+1. **Removes 0014's prefill cost by construction.** The cap is density-gated, not
+   global, so prefill keeps its 128 tile (S_PP neutral above).
+2. **Banks the col-tile-bound gain for free.** At npl128 the auto-select picks
+   `tile=64` for a 128-expert model (decode density 8 <= 8), i.e. exactly 0014's
+   `cap64`, so it reproduces 0014's **+4.8% @npl128 on Qwen3-Coder-30B** without the
+   -1.3% prefill cost. (That model was unavailable to re-bench here; the tile choice
+   is identical by construction.)
+3. **Prefill-safe and decode-neutral on the SSM model**, so it is harmless where it
+   does not help.
+4. **Correctness-gated** by the P0 harness (below).
+
+## Conservative by design (known limitation)
+
+A pure-density gate cannot separate two cases with the **same** per-expert density:
+Qwen3-Coder npl256 decode (density 16) and the 256-expert prefill ubatch (density
+16) are identical to the estimator. `density_max=8` therefore **forgoes 0014's
+2.3% @npl256** on the 128-expert model to keep 256-expert prefill safe. Recovering
+it needs an `ne12`-aware (absolute token count) gate in addition to density; scoped
+as future work, not implemented.
+
+## Knobs
+
+- `LLAMA_MOE_AUTO_TILE=0` : disable the auto-select, exact stock `mmq_x` selection.
+- `LLAMA_MOE_MMQ_X=<n>` (patch 0014) : **kept** as a manual override; when > 0 it
+  forces the old blunt global cap and bypasses the auto-select (explicit A/B knob).
+- `LLAMA_MOE_DECODE_TILE=<n>` : the small tile (default 64).
+- `LLAMA_MOE_DENSITY_MAX=<n>` : the density ceiling (default 8).
+
+## P0 correctness gate
+
+`tests/test-backend-ops` `test_mul_mat_id` is extended with a ragged small-M
+NVFP4/MXFP4 MoE decode-density block: 128 experts, top-8, m=768, k=2048, n in
+`{16,33,64,128,130,200,256,512}` spanning the cap boundary (n>=130 keeps the 128
+tile at `density_max=8`, n<=128 takes tile 64) and ragged token counts (experts with
+0/1/2 tokens, n not a multiple of the tile). All 16 shapes pass the CUDA-vs-CPU
+oracle on GB10 both default-on and with `LLAMA_MOE_AUTO_TILE=0`; full `MUL_MAT_ID`
+suite 2/2 backends OK. Off the ids path nothing changes (non-MoE `mul_mat`
+byte-identical to stock).
+
+## Verdict
+
+- Correct, prefill-safe, default-on density-aware tile select; the durable design
+  0014's own doc scoped. Supersedes 0014's global cap as the default path; the
+  `LLAMA_MOE_MMQ_X` knob is retained as a manual override.
+- **Net effect on q36-35b-a3b NVFP4: neutral** (decode within noise, prefill neutral)
+  because the model is SSM/bandwidth-bound, not col-tile-bound. The lever's real win
+  lives on col-tile-bound MoE (Qwen3-Coder-30B, +4.8% @npl128), banked here at zero
+  prefill cost.