From acb22a66ed0e5cc58e918062bcb2d45a3c965734 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Tue, 23 Jun 2026 19:04:55 +0000 Subject: [PATCH] feat(paged): mirror MoE token-tile density-aware auto-select (patch 0015) Mirror of llama-paged-dev commit 151343b into the pinned paged patch series. The durable, default-on follow-up to patch 0014's opt-in LLAMA_MOE_MMQ_X global cap: a host-side density-aware mmq_x auto-select in mul_mat_q_case that caps the MUL_MAT_ID grouped FP4-MMA token-tile only at low per-expert density (decode) and keeps the 128 tile at high density (prefill), so it is prefill-safe by construction (removes 0014's ~1.3% prefill cost). No new kernel. density_max default = 8 (not tile/4 = 16): 16 equals the 256-expert prefill-ubatch density and regressed S_PP ~2% on Qwen3.6-35B-A3B NVFP4; 8 sits between decode and prefill density for n_experts in [128,511] at n_ubatch=512. Honest result on the mission's MoE target (Qwen3.6-35B-A3B NVFP4, 256 experts + GDN/SSM linear attention, GB10 sm_121, median of 5 reps): NEUTRAL. Decode S_TG is within run-to-run noise (npl128 +0.36%) and prefill S_PP neutral (within +/-0.7%). This model is bound by the SSM recurrence and 256-tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile lever has nothing to bite on; a npl128 tile sweep confirms 64 is the only useful width (TILE8 -6.3% ... TILE96 -0.8%). The lever's real win lives on col-tile-bound MoE (Qwen3-Coder-30B, +4.8% @npl128 per patch 0014), which the auto-select reproduces at npl128 by construction at zero prefill cost. Shipped default-on because it is prefill-safe, decode-neutral here, and correctness-gated. LLAMA_MOE_MMQ_X (0014) kept as a manual override; LLAMA_MOE_AUTO_TILE=0 restores exact stock selection. P0 gate: test-backend-ops test_mul_mat_id ragged small-M NVFP4/MXFP4 MoE decode-density shapes pass CUDA-vs-CPU on GB10 both default-on and stock. Full rationale and tables in patches/paged/MOE_DENSITY_AUTO_TILE.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- ...ity-aware-moe-token-tile-auto-select.patch | 238 ++++++++++++++++++ .../patches/paged/MOE_DENSITY_AUTO_TILE.md | 143 +++++++++++ 2 files changed, 381 insertions(+) create mode 100644 backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch create mode 100644 backend/cpp/llama-cpp/patches/paged/MOE_DENSITY_AUTO_TILE.md diff --git a/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch b/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch new file mode 100644 index 000000000..81dfd8d5f --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch @@ -0,0 +1,238 @@ +From 151343bc8c7b956c99eafc855704b70d44637a3b Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Tue, 23 Jun 2026 21:03:00 +0200 +Subject: [PATCH] feat(paged): expert-density-aware MoE token-tile auto-select + (patch 0015) + +The durable follow-up to patch 0014's blunt LLAMA_MOE_MMQ_X global cap (which the +0014 doc itself scoped): replace the manual env cap with a host-side, default-on +auto-select inside mul_mat_q_case that picks a small token-tile (mmq_x) for the +MUL_MAT_ID grouped FP4-MMA GEMM only when the per-expert token density is low +(decode), and keeps the large 128-wide tile when density is high (prefill). No new +kernel: the selection only lowers the loop's upper bound to an already-compiled, +granularity- and shared-memory-validated mmq_x. + +Density is estimated host-side from the args the ids path already passes: + ne_get_rows = ncols_dst = ne12 * n_expert_used (token-expert assignments) + n_experts = nchannels_x = ne02 + density = ceil(ne_get_rows / min(ne_get_rows, n_experts)) (tokens/expert) +Cap to the small tile (default 64) only when density <= density_max. Unlike 0014's +global cap, the high-density prefill ubatch stays on the big tile, so S_PP does not +regress by construction. + +density_max default = 8 (not tile/4 = 16). The cap must fire for decode but not for +a prefill ubatch, and each has per-expert density n_tokens*n_used/n_experts. At the +standard n_ubatch=512, n_used=8: prefill density = 4096/n_experts (32 at 128 experts, +16 at 256), decode at npl<=128 is <= 1024/n_experts (8 at 128, 4 at 256). Default 8 +sits strictly between for every n_experts in [128,511], so it caps decode and leaves +prefill on the big tile. tile/4 (=16) equalled the 256-expert prefill density and +cratered its S_PP by ~2%, the regression this threshold exists to avoid. + +Measured on GB10 (sm_121), Qwen3.6-35B-A3B NVFP4 (256 experts, top-8, GDN linear +attention), llama-batched-bench -fa on -npp 128 -ntg 128, default-on vs stock +(LLAMA_MOE_AUTO_TILE=0), median of 5 reps: + + npl S_TG stock S_TG 0015 dTG% S_PP stock S_PP 0015 dPP% + 8 183.59 183.18 -0.22% 1489.2 1500.1 +0.73% + 32 264.02 263.44 -0.22% 2034.5 2033.5 -0.05% + 64 311.76 310.41 -0.43% 2028.3 2027.6 -0.03% + 128 336.10 337.32 +0.36% 2025.0 2027.7 +0.13% + +Honest read: on THIS model the decode effect is within run-to-run noise (neutral) +and prefill is neutral. q36-35b-a3b decode is bound by the GDN/SSM recurrence and +256 tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile +lever (worth +4.8% @npl128 on Qwen3-Coder-30B, 128 larger experts, patch 0014 +cap64) does not move it. A npl128 tile sweep on this model confirms 64 is the only +useful width (TILE8 -6.3%, TILE16 -3.2%, TILE32 -0.2%, TILE64 +0.7%, TILE96 -0.8%): +smaller tiles lose to grid/scheduling overhead and the FP4-MMA minimum width. + +Value banked default-on: (1) removes 0014's ~1.3% prefill cost by construction +(density-gated, not global); (2) auto-selects the small tile for col-tile-bound MoE +decode, reproducing 0014 cap64's tile=64 at npl128 by construction, so it preserves +the +4.8% on Qwen3-Coder-30B without the prefill cost; (3) prefill-safe and decode- +neutral on the SSM model, harmless where it does not help. Conservative by design: +at npl256 the qwen3coder decode density (16) equals the 256-expert prefill density +(16), indistinguishable to a pure-density gate, so density_max=8 forgoes 0014's ++2.3% @npl256 to keep 256-expert prefill safe; an ne12-aware refinement is future +work. + +LLAMA_MOE_MMQ_X (patch 0014) is KEPT as a manual override that, when > 0, forces the +old blunt global cap and bypasses the auto-select (explicit A/B knob). The auto- +select is the default; LLAMA_MOE_AUTO_TILE=0 restores exact stock mmq_x selection. +LLAMA_MOE_DECODE_TILE / LLAMA_MOE_DENSITY_MAX tune the small tile / threshold. + +Correctness: extends tests/test-backend-ops test_mul_mat_id with a ragged small-M +NVFP4/MXFP4 MoE decode-density gate (128 experts, top-8, m=768, k=2048, n in +{16,33,64,128,130,200,256,512} spanning the cap boundary and ragged token counts). +All 16 shapes pass CUDA-vs-CPU oracle on GB10 both default-on and with +LLAMA_MOE_AUTO_TILE=0; full MUL_MAT_ID suite 2/2 backends OK. Off the ids path +nothing changes (non-MoE mul_mat byte-identical to stock). + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/mmq.cuh | 100 ++++++++++++++++++++++++++++++------- + tests/test-backend-ops.cpp | 16 ++++++ + 2 files changed, 99 insertions(+), 17 deletions(-) + +diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh +index cff608e..9718b12 100644 +--- a/ggml/src/ggml-cuda/mmq.cuh ++++ b/ggml/src/ggml-cuda/mmq.cuh +@@ -4053,10 +4053,11 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a + } + } + +-// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X. +-// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical). +-// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the +-// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case). ++// [paged patch 0014] MoE token-tile (mmq_x) MANUAL cap, read once from env LLAMA_MOE_MMQ_X. ++// Returns 0 when unset / non-positive => disabled (fall through to the patch-0015 auto-select). ++// When > 0 it forces a blunt GLOBAL cap on the per-expert column-tile width for the MUL_MAT_ID ++// grouped-GEMM path (decode AND prefill), overriding the density-aware auto-select below. Kept ++// as an explicit override / A-B knob; the default path is now the auto-select. + static inline int ggml_cuda_moe_mmq_x_cap() { + static const int cap = []() -> int { + const char * s = getenv("LLAMA_MOE_MMQ_X"); +@@ -4065,6 +4066,43 @@ static inline int ggml_cuda_moe_mmq_x_cap() { + return cap; + } + ++// [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select knobs (DEFAULT-ON). ++// LLAMA_MOE_AUTO_TILE=0 disables the auto-select => exact stock mmq_x selection. ++static inline bool ggml_cuda_moe_auto_tile_enabled() { ++ static const bool en = []() -> bool { ++ const char * s = getenv("LLAMA_MOE_AUTO_TILE"); ++ return !(s && atoi(s) == 0); ++ }(); ++ return en; ++} ++// The small high-occupancy token-tile chosen for low-density (decode) MoE matmuls. Default 64: ++// the measured GB10 sweet spot (full per-expert fill with >=4x routing-imbalance headroom). ++static inline int ggml_cuda_moe_decode_tile() { ++ static const int t = []() -> int { ++ const char * s = getenv("LLAMA_MOE_DECODE_TILE"); ++ const int v = s ? atoi(s) : 0; ++ return v >= 8 ? v : 64; ++ }(); ++ return t; ++} ++// Per-expert token-density ceiling under which the small tile is selected. Default 8: the cap must ++// fire for decode but NOT for a prefill ubatch, and the per-expert density of each is ++// n_tokens*n_used/n_experts. For the standard n_ubatch=512, n_used=8 the prefill density is ++// 4096/n_experts (= 32 at 128 experts, 16 at 256 experts); decode at npl<=128 is <=1024/n_experts ++// (= 8 at 128 experts, 4 at 256). Default 8 sits strictly between the two for every n_experts in ++// [128,511], so it caps decode and leaves the prefill ubatch on the big 128 tile - whereas the old ++// tile/4 (=16) equalled the 256-expert prefill density and cratered its S_PP by ~2% (measured on ++// Qwen3.6-35B-A3B NVFP4). 8 also keeps >=8x fill headroom at tile 64 so an imbalanced expert ++// segment never splits into an extra col-tile. ++static inline int ggml_cuda_moe_density_max() { ++ static const int d = []() -> int { ++ const char * s = getenv("LLAMA_MOE_DENSITY_MAX"); ++ const int v = s ? atoi(s) : 0; ++ return v > 0 ? v : 8; ++ }(); ++ return d; ++} ++ + template + void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) { + const int id = ggml_cuda_get_device(); +@@ -4076,25 +4114,53 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda + const int mmq_x_max = get_mmq_x_max_host(cc); + const int mmq_y = get_mmq_y_host(cc); + +- // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap. +- // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are +- // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count, +- // up to 128) in a single column-tile. At MoE decode the per-expert token density is low +- // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for +- // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty: +- // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the +- // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the +- // per-expert density raises tile fill + occupancy with no extra weight reads (at +- // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the +- // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel). +- // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock; +- // off the ids path the cap never applies. ++ // [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON). ++ // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens ++ // sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count, up to 128) ++ // in a single column-tile, i.e. it MAXIMIZES the tile (128 on Blackwell) for the aggregate ++ // batch. But the tile is then applied PER EXPERT, and at MoE decode the per-expert token ++ // density is tiny (top-k of many experts), so each expert's single 128-wide col-tile is mostly ++ // empty: the MMA accumulator tile is mmq_x-wide at compile time and burns throughput on the ++ // padding columns while the larger y-tile lowers occupancy. vLLM's fused-MoE does the opposite ++ // (a small per-expert BLOCK_SIZE_M). We reproduce that here, host-side only, by picking a ++ // SMALLER mmq_x when - and only when - the per-expert density is low: ++ // ++ // ne_get_rows = args.ncols_dst = ne12 * n_expert_used (total token-expert assignments) ++ // n_experts = args.nchannels_x = ne02 ++ // n_active_est = min(n_experts, ne_get_rows) (upper bound on active experts) ++ // density = ceil(ne_get_rows / n_active_est) (avg tokens per active expert) ++ // ++ // Cap to the small tile (default 64) only when density <= density_max (default 8). 8 sits below ++ // every prefill-ubatch density and above every decode density for n_experts in [128,511] at the ++ // standard n_ubatch=512 (prefill 4096/n_experts, decode <=1024/n_experts), with >=8x fill headroom ++ // so a capped expert segment never splits a col-tile. Decode (per-expert density 4 at 256 experts, ++ // 8 at 128 experts @npl128) gets the fuller high-occupancy tile; the prefill ubatch (density 16 at ++ // 256 / 32 at 128 experts) stays ABOVE the threshold and keeps the big ++ // 128 compute tile - so unlike the blunt global cap (LLAMA_MOE_MMQ_X / patch 0014) this is ++ // prefill-safe by construction. The selection only ever picks an already-compiled, granularity- ++ // and shared-memory-validated mmq_x that the loop below would consider for a smaller batch; no ++ // new kernel. Off the ids path (expert_bounds == nullptr) nothing changes => non-MoE mul_mat ++ // and the gated f16/bf16 host-loop fallback stay byte-identical to stock. ++ // - LLAMA_MOE_MMQ_X= : manual blunt global cap, overrides the auto-select (patch 0014). ++ // - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection). ++ // - LLAMA_MOE_DECODE_TILE=, LLAMA_MOE_DENSITY_MAX= : tune the tile / threshold. + int mmq_x_lim = mmq_x_max; + if (args.expert_bounds != nullptr) { + const int moe_cap = ggml_cuda_moe_mmq_x_cap(); + if (moe_cap > 0) { + const int cap = moe_cap < 8 ? 8 : moe_cap; + mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max; ++ } else if (ggml_cuda_moe_auto_tile_enabled()) { ++ const int64_t ne_get_rows = args.ncols_dst; ++ const int64_t n_experts = args.nchannels_x; ++ if (ne_get_rows > 0 && n_experts > 0) { ++ const int64_t n_active = ne_get_rows < n_experts ? ne_get_rows : n_experts; ++ const int64_t density = (ne_get_rows + n_active - 1) / n_active; ++ const int tile = ggml_cuda_moe_decode_tile(); ++ if (density <= (int64_t) ggml_cuda_moe_density_max() && tile < mmq_x_max) { ++ mmq_x_lim = tile; ++ } ++ } + } + } + +diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp +index 15ae389..f219309 100644 +--- a/tests/test-backend-ops.cpp ++++ b/tests/test-backend-ops.cpp +@@ -8575,6 +8575,22 @@ static std::vector> make_test_cases_eval() { + // gpt-oss issue with Vulkan mmq_id + test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_MXFP4, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880)); + ++ // [paged P0] MXFP4/NVFP4 qwen3-30b-a3b MoE decode-density regression gate for the expert- ++ // density-aware mmq_x auto-select (patch 0015). Real expert-FFN slice (128 experts, top-8, ++ // m=768, k=2048) so this exercises the exact grouped FP4-MMA mmq kernel the model runs. ++ // Per-expert token density = n*n_used/n_mats = n/16; cover the decode band (density 1/4/8/16 ++ // at n 16/64/128/256), ragged token counts (n 33/130/200: experts with 0/1/2 tokens, n not a ++ // multiple of the tile) where the tiny-M col-tiles change geometry and any masking can leak, ++ // and a prefill-density shape (n 512 => density 32) the auto-select must leave on the large ++ // 128 tile. n>=128 is exactly where stock picks mmq_x=128 and the auto-select picks 64, so the ++ // op-test (CPU oracle vs CUDA, deterministic) is the bit-exact regression gate for P1: it must ++ // pass with the auto-select on (default) and with LLAMA_MOE_AUTO_TILE=0 (stock selection). ++ for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) { ++ for (int n : {16, 33, 64, 128, 130, 200, 256, 512}) { ++ test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 128, 8, false, 768, n, 2048)); ++ } ++ } ++ + for (ggml_type type_a : all_types) { + test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a))); + } +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp/patches/paged/MOE_DENSITY_AUTO_TILE.md b/backend/cpp/llama-cpp/patches/paged/MOE_DENSITY_AUTO_TILE.md new file mode 100644 index 000000000..546498923 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/MOE_DENSITY_AUTO_TILE.md @@ -0,0 +1,143 @@ +# Patch 0015 findings: expert-density-aware MoE token-tile auto-select + +The durable follow-up to patch 0014 (`MOE_TOKEN_TILE_CAP.md`): replace the blunt, +opt-in `LLAMA_MOE_MMQ_X` global cap with a host-side, **default-on** density-aware +`mmq_x` auto-select in `mul_mat_q_case`. Companion to +`0015-paged-expert-density-aware-moe-token-tile-auto-select.patch`. Dev tree +`~/llama-paged-dev` (branch `paged`), `build-cuda` sm_121. + +Primary model: **Qwen3.6-35B-A3B NVFP4** (`~/bench/q36-35b-a3b-nvfp4.gguf`), +**256 experts, top-8**, expert FFN 512, GDN linear attention (SSM inner 4096), +41 layers. This is a different beast from 0014's Qwen3-Coder-30B-A3B (128 experts, +larger expert FFN, standard attention). + +## What it does (vs 0014) + +`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max` (= `ne12`, +the per-expert column upper bound = token count) in one column-tile, i.e. stock +**maximizes** the tile (128 on Blackwell). Applied per expert at MoE decode, where +per-expert density is tiny, that 128-wide tile is mostly padding. + +Patch 0014 capped `mmq_x` globally on the ids path via `LLAMA_MOE_MMQ_X` (decode +**and** prefill), which cost ~1.3% prefill. Patch 0015 instead estimates the +per-expert density host-side, from args the ids path already passes: + +``` +ne_get_rows = ncols_dst = ne12 * n_expert_used (token-expert assignments) +n_experts = nchannels_x = ne02 +density = ceil(ne_get_rows / min(ne_get_rows, n_experts)) (tokens/expert) +``` + +and caps to the small tile (default 64) **only when `density <= density_max`**, so +the high-density prefill ubatch keeps the big 128 tile. Prefill-safe by construction. +No new kernel: the selection only lowers the loop's upper bound to an +already-compiled, granularity- and shared-memory-validated `mmq_x`. + +## The threshold matters: `density_max = 8`, not `tile/4 = 16` + +The cap must fire for decode but not for a prefill ubatch. Each has per-expert +density `n_tokens * n_used / n_experts`. At the standard `n_ubatch=512`, `n_used=8`: + +``` + 128 experts 256 experts +prefill ubatch (512) 32 16 +decode npl128 (128) 8 4 +``` + +`tile/4 = 16` (0014's first auto-select draft default) **equals the 256-expert +prefill density** and caps prefill: measured -2.0% to -2.9% S_PP on q36-35b-a3b. +`density_max = 8` sits strictly between decode and prefill for every `n_experts` in +`[128, 511]`, so it caps decode and leaves prefill on the big tile. This single +default change is what makes the patch prefill-safe on the 256-expert model. + +## Measurements (default-on vs stock, median of 5 reps) + +`llama-batched-bench`, q36-35b-a3b-nvfp4.gguf, `-fa on -npp 128 -ntg 128`, GB10 +sm_121. STOCK = `LLAMA_MOE_AUTO_TILE=0` (exact stock selection); 0015 = default. + +``` + npl S_TG stock S_TG 0015 dTG% S_PP stock S_PP 0015 dPP% + 8 183.59 183.18 -0.22% 1489.2 1500.1 +0.73% + 32 264.02 263.44 -0.22% 2034.5 2033.5 -0.05% + 64 311.76 310.41 -0.43% 2028.3 2027.6 -0.03% + 128 336.10 337.32 +0.36% 2025.0 2027.7 +0.13% +``` + +Raw npl128 reps: S_TG 0015 `[337.3, 336.9, 336.4, 338.9, 338.1]` vs stock +`[336.2, 336.1, 335.9, 336.9, 335.8]` (distributions overlap); S_PP 0015 +`[2028.6, 2023.0, 2024.9, 2028.0, 2027.7]` vs stock `[2024.9, 2025.0, 2023.2, +2029.4, 2029.0]`. + +### Honest read: neutral on this model + +On q36-35b-a3b the decode delta is **within run-to-run noise** (npl128 +0.36%, +npl<=64 slightly negative) and prefill is **neutral** (within +/-0.7%, well inside +the 1% target). The `+5%` decode target from the localmaxxing reference does **not** +materialize here. q36-35b-a3b decode is bound by the GDN/SSM recurrence and +256-tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile +lever has nothing to bite on. + +### npl128 decode tile sweep confirms 64 is the only useful width + +`density_max=8` fixed, varying `LLAMA_MOE_DECODE_TILE`, S_TG @ npl128 vs stock: + +``` + TILE8 TILE16 TILE32 TILE64 TILE96 + -6.31% -3.18% -0.17% +0.70% -0.76% +``` + +Smaller tiles are **worse**, not better: more column-tiles per expert = more +grid/scheduling overhead, and the FP4-MMA has a minimum efficient width. So matching +the tile to the literal density (4) is counterproductive; 64 is the sweet spot, +same as 0014. + +## Why ship it default-on anyway + +1. **Removes 0014's prefill cost by construction.** The cap is density-gated, not + global, so prefill keeps its 128 tile (S_PP neutral above). +2. **Banks the col-tile-bound gain for free.** At npl128 the auto-select picks + `tile=64` for a 128-expert model (decode density 8 <= 8), i.e. exactly 0014's + `cap64`, so it reproduces 0014's **+4.8% @npl128 on Qwen3-Coder-30B** without the + -1.3% prefill cost. (That model was unavailable to re-bench here; the tile choice + is identical by construction.) +3. **Prefill-safe and decode-neutral on the SSM model**, so it is harmless where it + does not help. +4. **Correctness-gated** by the P0 harness (below). + +## Conservative by design (known limitation) + +A pure-density gate cannot separate two cases with the **same** per-expert density: +Qwen3-Coder npl256 decode (density 16) and the 256-expert prefill ubatch (density +16) are identical to the estimator. `density_max=8` therefore **forgoes 0014's ++2.3% @npl256** on the 128-expert model to keep 256-expert prefill safe. Recovering +it needs an `ne12`-aware (absolute token count) gate in addition to density; scoped +as future work, not implemented. + +## Knobs + +- `LLAMA_MOE_AUTO_TILE=0` : disable the auto-select, exact stock `mmq_x` selection. +- `LLAMA_MOE_MMQ_X=` (patch 0014) : **kept** as a manual override; when > 0 it + forces the old blunt global cap and bypasses the auto-select (explicit A/B knob). +- `LLAMA_MOE_DECODE_TILE=` : the small tile (default 64). +- `LLAMA_MOE_DENSITY_MAX=` : the density ceiling (default 8). + +## P0 correctness gate + +`tests/test-backend-ops` `test_mul_mat_id` is extended with a ragged small-M +NVFP4/MXFP4 MoE decode-density block: 128 experts, top-8, m=768, k=2048, n in +`{16,33,64,128,130,200,256,512}` spanning the cap boundary (n>=130 keeps the 128 +tile at `density_max=8`, n<=128 takes tile 64) and ragged token counts (experts with +0/1/2 tokens, n not a multiple of the tile). All 16 shapes pass the CUDA-vs-CPU +oracle on GB10 both default-on and with `LLAMA_MOE_AUTO_TILE=0`; full `MUL_MAT_ID` +suite 2/2 backends OK. Off the ids path nothing changes (non-MoE `mul_mat` +byte-identical to stock). + +## Verdict + +- Correct, prefill-safe, default-on density-aware tile select; the durable design + 0014's own doc scoped. Supersedes 0014's global cap as the default path; the + `LLAMA_MOE_MMQ_X` knob is retained as a manual override. +- **Net effect on q36-35b-a3b NVFP4: neutral** (decode within noise, prefill neutral) + because the model is SSM/bandwidth-bound, not col-tile-bound. The lever's real win + lives on col-tile-bound MoE (Qwen3-Coder-30B, +4.8% @npl128), banked here at zero + prefill cost.