mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 16:19:07 -04:00
feat(paged): mirror MoE token-tile density-aware auto-select (patch 0015)
Mirror of llama-paged-dev commit 151343b into the pinned paged patch series. The durable, default-on follow-up to patch 0014's opt-in LLAMA_MOE_MMQ_X global cap: a host-side density-aware mmq_x auto-select in mul_mat_q_case that caps the MUL_MAT_ID grouped FP4-MMA token-tile only at low per-expert density (decode) and keeps the 128 tile at high density (prefill), so it is prefill-safe by construction (removes 0014's ~1.3% prefill cost). No new kernel. density_max default = 8 (not tile/4 = 16): 16 equals the 256-expert prefill-ubatch density and regressed S_PP ~2% on Qwen3.6-35B-A3B NVFP4; 8 sits between decode and prefill density for n_experts in [128,511] at n_ubatch=512. Honest result on the mission's MoE target (Qwen3.6-35B-A3B NVFP4, 256 experts + GDN/SSM linear attention, GB10 sm_121, median of 5 reps): NEUTRAL. Decode S_TG is within run-to-run noise (npl128 +0.36%) and prefill S_PP neutral (within +/-0.7%). This model is bound by the SSM recurrence and 256-tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile lever has nothing to bite on; a npl128 tile sweep confirms 64 is the only useful width (TILE8 -6.3% ... TILE96 -0.8%). The lever's real win lives on col-tile-bound MoE (Qwen3-Coder-30B, +4.8% @npl128 per patch 0014), which the auto-select reproduces at npl128 by construction at zero prefill cost. Shipped default-on because it is prefill-safe, decode-neutral here, and correctness-gated. LLAMA_MOE_MMQ_X (0014) kept as a manual override; LLAMA_MOE_AUTO_TILE=0 restores exact stock selection. P0 gate: test-backend-ops test_mul_mat_id ragged small-M NVFP4/MXFP4 MoE decode-density shapes pass CUDA-vs-CPU on GB10 both default-on and stock. Full rationale and tables in patches/paged/MOE_DENSITY_AUTO_TILE.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -0,0 +1,238 @@
|
||||
From 151343bc8c7b956c99eafc855704b70d44637a3b Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Tue, 23 Jun 2026 21:03:00 +0200
|
||||
Subject: [PATCH] feat(paged): expert-density-aware MoE token-tile auto-select
|
||||
(patch 0015)
|
||||
|
||||
The durable follow-up to patch 0014's blunt LLAMA_MOE_MMQ_X global cap (which the
|
||||
0014 doc itself scoped): replace the manual env cap with a host-side, default-on
|
||||
auto-select inside mul_mat_q_case that picks a small token-tile (mmq_x) for the
|
||||
MUL_MAT_ID grouped FP4-MMA GEMM only when the per-expert token density is low
|
||||
(decode), and keeps the large 128-wide tile when density is high (prefill). No new
|
||||
kernel: the selection only lowers the loop's upper bound to an already-compiled,
|
||||
granularity- and shared-memory-validated mmq_x.
|
||||
|
||||
Density is estimated host-side from the args the ids path already passes:
|
||||
ne_get_rows = ncols_dst = ne12 * n_expert_used (token-expert assignments)
|
||||
n_experts = nchannels_x = ne02
|
||||
density = ceil(ne_get_rows / min(ne_get_rows, n_experts)) (tokens/expert)
|
||||
Cap to the small tile (default 64) only when density <= density_max. Unlike 0014's
|
||||
global cap, the high-density prefill ubatch stays on the big tile, so S_PP does not
|
||||
regress by construction.
|
||||
|
||||
density_max default = 8 (not tile/4 = 16). The cap must fire for decode but not for
|
||||
a prefill ubatch, and each has per-expert density n_tokens*n_used/n_experts. At the
|
||||
standard n_ubatch=512, n_used=8: prefill density = 4096/n_experts (32 at 128 experts,
|
||||
16 at 256), decode at npl<=128 is <= 1024/n_experts (8 at 128, 4 at 256). Default 8
|
||||
sits strictly between for every n_experts in [128,511], so it caps decode and leaves
|
||||
prefill on the big tile. tile/4 (=16) equalled the 256-expert prefill density and
|
||||
cratered its S_PP by ~2%, the regression this threshold exists to avoid.
|
||||
|
||||
Measured on GB10 (sm_121), Qwen3.6-35B-A3B NVFP4 (256 experts, top-8, GDN linear
|
||||
attention), llama-batched-bench -fa on -npp 128 -ntg 128, default-on vs stock
|
||||
(LLAMA_MOE_AUTO_TILE=0), median of 5 reps:
|
||||
|
||||
npl S_TG stock S_TG 0015 dTG% S_PP stock S_PP 0015 dPP%
|
||||
8 183.59 183.18 -0.22% 1489.2 1500.1 +0.73%
|
||||
32 264.02 263.44 -0.22% 2034.5 2033.5 -0.05%
|
||||
64 311.76 310.41 -0.43% 2028.3 2027.6 -0.03%
|
||||
128 336.10 337.32 +0.36% 2025.0 2027.7 +0.13%
|
||||
|
||||
Honest read: on THIS model the decode effect is within run-to-run noise (neutral)
|
||||
and prefill is neutral. q36-35b-a3b decode is bound by the GDN/SSM recurrence and
|
||||
256 tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile
|
||||
lever (worth +4.8% @npl128 on Qwen3-Coder-30B, 128 larger experts, patch 0014
|
||||
cap64) does not move it. A npl128 tile sweep on this model confirms 64 is the only
|
||||
useful width (TILE8 -6.3%, TILE16 -3.2%, TILE32 -0.2%, TILE64 +0.7%, TILE96 -0.8%):
|
||||
smaller tiles lose to grid/scheduling overhead and the FP4-MMA minimum width.
|
||||
|
||||
Value banked default-on: (1) removes 0014's ~1.3% prefill cost by construction
|
||||
(density-gated, not global); (2) auto-selects the small tile for col-tile-bound MoE
|
||||
decode, reproducing 0014 cap64's tile=64 at npl128 by construction, so it preserves
|
||||
the +4.8% on Qwen3-Coder-30B without the prefill cost; (3) prefill-safe and decode-
|
||||
neutral on the SSM model, harmless where it does not help. Conservative by design:
|
||||
at npl256 the qwen3coder decode density (16) equals the 256-expert prefill density
|
||||
(16), indistinguishable to a pure-density gate, so density_max=8 forgoes 0014's
|
||||
+2.3% @npl256 to keep 256-expert prefill safe; an ne12-aware refinement is future
|
||||
work.
|
||||
|
||||
LLAMA_MOE_MMQ_X (patch 0014) is KEPT as a manual override that, when > 0, forces the
|
||||
old blunt global cap and bypasses the auto-select (explicit A/B knob). The auto-
|
||||
select is the default; LLAMA_MOE_AUTO_TILE=0 restores exact stock mmq_x selection.
|
||||
LLAMA_MOE_DECODE_TILE / LLAMA_MOE_DENSITY_MAX tune the small tile / threshold.
|
||||
|
||||
Correctness: extends tests/test-backend-ops test_mul_mat_id with a ragged small-M
|
||||
NVFP4/MXFP4 MoE decode-density gate (128 experts, top-8, m=768, k=2048, n in
|
||||
{16,33,64,128,130,200,256,512} spanning the cap boundary and ragged token counts).
|
||||
All 16 shapes pass CUDA-vs-CPU oracle on GB10 both default-on and with
|
||||
LLAMA_MOE_AUTO_TILE=0; full MUL_MAT_ID suite 2/2 backends OK. Off the ids path
|
||||
nothing changes (non-MoE mul_mat byte-identical to stock).
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||||
---
|
||||
ggml/src/ggml-cuda/mmq.cuh | 100 ++++++++++++++++++++++++++++++-------
|
||||
tests/test-backend-ops.cpp | 16 ++++++
|
||||
2 files changed, 99 insertions(+), 17 deletions(-)
|
||||
|
||||
diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
|
||||
index cff608e..9718b12 100644
|
||||
--- a/ggml/src/ggml-cuda/mmq.cuh
|
||||
+++ b/ggml/src/ggml-cuda/mmq.cuh
|
||||
@@ -4053,10 +4053,11 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
|
||||
}
|
||||
}
|
||||
|
||||
-// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X.
|
||||
-// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical).
|
||||
-// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the
|
||||
-// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case).
|
||||
+// [paged patch 0014] MoE token-tile (mmq_x) MANUAL cap, read once from env LLAMA_MOE_MMQ_X.
|
||||
+// Returns 0 when unset / non-positive => disabled (fall through to the patch-0015 auto-select).
|
||||
+// When > 0 it forces a blunt GLOBAL cap on the per-expert column-tile width for the MUL_MAT_ID
|
||||
+// grouped-GEMM path (decode AND prefill), overriding the density-aware auto-select below. Kept
|
||||
+// as an explicit override / A-B knob; the default path is now the auto-select.
|
||||
static inline int ggml_cuda_moe_mmq_x_cap() {
|
||||
static const int cap = []() -> int {
|
||||
const char * s = getenv("LLAMA_MOE_MMQ_X");
|
||||
@@ -4065,6 +4066,43 @@ static inline int ggml_cuda_moe_mmq_x_cap() {
|
||||
return cap;
|
||||
}
|
||||
|
||||
+// [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select knobs (DEFAULT-ON).
|
||||
+// LLAMA_MOE_AUTO_TILE=0 disables the auto-select => exact stock mmq_x selection.
|
||||
+static inline bool ggml_cuda_moe_auto_tile_enabled() {
|
||||
+ static const bool en = []() -> bool {
|
||||
+ const char * s = getenv("LLAMA_MOE_AUTO_TILE");
|
||||
+ return !(s && atoi(s) == 0);
|
||||
+ }();
|
||||
+ return en;
|
||||
+}
|
||||
+// The small high-occupancy token-tile chosen for low-density (decode) MoE matmuls. Default 64:
|
||||
+// the measured GB10 sweet spot (full per-expert fill with >=4x routing-imbalance headroom).
|
||||
+static inline int ggml_cuda_moe_decode_tile() {
|
||||
+ static const int t = []() -> int {
|
||||
+ const char * s = getenv("LLAMA_MOE_DECODE_TILE");
|
||||
+ const int v = s ? atoi(s) : 0;
|
||||
+ return v >= 8 ? v : 64;
|
||||
+ }();
|
||||
+ return t;
|
||||
+}
|
||||
+// Per-expert token-density ceiling under which the small tile is selected. Default 8: the cap must
|
||||
+// fire for decode but NOT for a prefill ubatch, and the per-expert density of each is
|
||||
+// n_tokens*n_used/n_experts. For the standard n_ubatch=512, n_used=8 the prefill density is
|
||||
+// 4096/n_experts (= 32 at 128 experts, 16 at 256 experts); decode at npl<=128 is <=1024/n_experts
|
||||
+// (= 8 at 128 experts, 4 at 256). Default 8 sits strictly between the two for every n_experts in
|
||||
+// [128,511], so it caps decode and leaves the prefill ubatch on the big 128 tile - whereas the old
|
||||
+// tile/4 (=16) equalled the 256-expert prefill density and cratered its S_PP by ~2% (measured on
|
||||
+// Qwen3.6-35B-A3B NVFP4). 8 also keeps >=8x fill headroom at tile 64 so an imbalanced expert
|
||||
+// segment never splits into an extra col-tile.
|
||||
+static inline int ggml_cuda_moe_density_max() {
|
||||
+ static const int d = []() -> int {
|
||||
+ const char * s = getenv("LLAMA_MOE_DENSITY_MAX");
|
||||
+ const int v = s ? atoi(s) : 0;
|
||||
+ return v > 0 ? v : 8;
|
||||
+ }();
|
||||
+ return d;
|
||||
+}
|
||||
+
|
||||
template <ggml_type type>
|
||||
void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
|
||||
const int id = ggml_cuda_get_device();
|
||||
@@ -4076,25 +4114,53 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
|
||||
const int mmq_x_max = get_mmq_x_max_host(cc);
|
||||
const int mmq_y = get_mmq_y_host(cc);
|
||||
|
||||
- // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap.
|
||||
- // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are
|
||||
- // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count,
|
||||
- // up to 128) in a single column-tile. At MoE decode the per-expert token density is low
|
||||
- // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for
|
||||
- // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty:
|
||||
- // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the
|
||||
- // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the
|
||||
- // per-expert density raises tile fill + occupancy with no extra weight reads (at
|
||||
- // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the
|
||||
- // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel).
|
||||
- // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock;
|
||||
- // off the ids path the cap never applies.
|
||||
+ // [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON).
|
||||
+ // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens
|
||||
+ // sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count, up to 128)
|
||||
+ // in a single column-tile, i.e. it MAXIMIZES the tile (128 on Blackwell) for the aggregate
|
||||
+ // batch. But the tile is then applied PER EXPERT, and at MoE decode the per-expert token
|
||||
+ // density is tiny (top-k of many experts), so each expert's single 128-wide col-tile is mostly
|
||||
+ // empty: the MMA accumulator tile is mmq_x-wide at compile time and burns throughput on the
|
||||
+ // padding columns while the larger y-tile lowers occupancy. vLLM's fused-MoE does the opposite
|
||||
+ // (a small per-expert BLOCK_SIZE_M). We reproduce that here, host-side only, by picking a
|
||||
+ // SMALLER mmq_x when - and only when - the per-expert density is low:
|
||||
+ //
|
||||
+ // ne_get_rows = args.ncols_dst = ne12 * n_expert_used (total token-expert assignments)
|
||||
+ // n_experts = args.nchannels_x = ne02
|
||||
+ // n_active_est = min(n_experts, ne_get_rows) (upper bound on active experts)
|
||||
+ // density = ceil(ne_get_rows / n_active_est) (avg tokens per active expert)
|
||||
+ //
|
||||
+ // Cap to the small tile (default 64) only when density <= density_max (default 8). 8 sits below
|
||||
+ // every prefill-ubatch density and above every decode density for n_experts in [128,511] at the
|
||||
+ // standard n_ubatch=512 (prefill 4096/n_experts, decode <=1024/n_experts), with >=8x fill headroom
|
||||
+ // so a capped expert segment never splits a col-tile. Decode (per-expert density 4 at 256 experts,
|
||||
+ // 8 at 128 experts @npl128) gets the fuller high-occupancy tile; the prefill ubatch (density 16 at
|
||||
+ // 256 / 32 at 128 experts) stays ABOVE the threshold and keeps the big
|
||||
+ // 128 compute tile - so unlike the blunt global cap (LLAMA_MOE_MMQ_X / patch 0014) this is
|
||||
+ // prefill-safe by construction. The selection only ever picks an already-compiled, granularity-
|
||||
+ // and shared-memory-validated mmq_x that the loop below would consider for a smaller batch; no
|
||||
+ // new kernel. Off the ids path (expert_bounds == nullptr) nothing changes => non-MoE mul_mat
|
||||
+ // and the gated f16/bf16 host-loop fallback stay byte-identical to stock.
|
||||
+ // - LLAMA_MOE_MMQ_X=<n> : manual blunt global cap, overrides the auto-select (patch 0014).
|
||||
+ // - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection).
|
||||
+ // - LLAMA_MOE_DECODE_TILE=<n>, LLAMA_MOE_DENSITY_MAX=<n> : tune the tile / threshold.
|
||||
int mmq_x_lim = mmq_x_max;
|
||||
if (args.expert_bounds != nullptr) {
|
||||
const int moe_cap = ggml_cuda_moe_mmq_x_cap();
|
||||
if (moe_cap > 0) {
|
||||
const int cap = moe_cap < 8 ? 8 : moe_cap;
|
||||
mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max;
|
||||
+ } else if (ggml_cuda_moe_auto_tile_enabled()) {
|
||||
+ const int64_t ne_get_rows = args.ncols_dst;
|
||||
+ const int64_t n_experts = args.nchannels_x;
|
||||
+ if (ne_get_rows > 0 && n_experts > 0) {
|
||||
+ const int64_t n_active = ne_get_rows < n_experts ? ne_get_rows : n_experts;
|
||||
+ const int64_t density = (ne_get_rows + n_active - 1) / n_active;
|
||||
+ const int tile = ggml_cuda_moe_decode_tile();
|
||||
+ if (density <= (int64_t) ggml_cuda_moe_density_max() && tile < mmq_x_max) {
|
||||
+ mmq_x_lim = tile;
|
||||
+ }
|
||||
+ }
|
||||
}
|
||||
}
|
||||
|
||||
diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
|
||||
index 15ae389..f219309 100644
|
||||
--- a/tests/test-backend-ops.cpp
|
||||
+++ b/tests/test-backend-ops.cpp
|
||||
@@ -8575,6 +8575,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
|
||||
// gpt-oss issue with Vulkan mmq_id
|
||||
test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_MXFP4, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880));
|
||||
|
||||
+ // [paged P0] MXFP4/NVFP4 qwen3-30b-a3b MoE decode-density regression gate for the expert-
|
||||
+ // density-aware mmq_x auto-select (patch 0015). Real expert-FFN slice (128 experts, top-8,
|
||||
+ // m=768, k=2048) so this exercises the exact grouped FP4-MMA mmq kernel the model runs.
|
||||
+ // Per-expert token density = n*n_used/n_mats = n/16; cover the decode band (density 1/4/8/16
|
||||
+ // at n 16/64/128/256), ragged token counts (n 33/130/200: experts with 0/1/2 tokens, n not a
|
||||
+ // multiple of the tile) where the tiny-M col-tiles change geometry and any masking can leak,
|
||||
+ // and a prefill-density shape (n 512 => density 32) the auto-select must leave on the large
|
||||
+ // 128 tile. n>=128 is exactly where stock picks mmq_x=128 and the auto-select picks 64, so the
|
||||
+ // op-test (CPU oracle vs CUDA, deterministic) is the bit-exact regression gate for P1: it must
|
||||
+ // pass with the auto-select on (default) and with LLAMA_MOE_AUTO_TILE=0 (stock selection).
|
||||
+ for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) {
|
||||
+ for (int n : {16, 33, 64, 128, 130, 200, 256, 512}) {
|
||||
+ test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 128, 8, false, 768, n, 2048));
|
||||
+ }
|
||||
+ }
|
||||
+
|
||||
for (ggml_type type_a : all_types) {
|
||||
test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a)));
|
||||
}
|
||||
--
|
||||
2.43.0
|
||||
|
||||
143
backend/cpp/llama-cpp/patches/paged/MOE_DENSITY_AUTO_TILE.md
Normal file
143
backend/cpp/llama-cpp/patches/paged/MOE_DENSITY_AUTO_TILE.md
Normal file
@@ -0,0 +1,143 @@
|
||||
# Patch 0015 findings: expert-density-aware MoE token-tile auto-select
|
||||
|
||||
The durable follow-up to patch 0014 (`MOE_TOKEN_TILE_CAP.md`): replace the blunt,
|
||||
opt-in `LLAMA_MOE_MMQ_X` global cap with a host-side, **default-on** density-aware
|
||||
`mmq_x` auto-select in `mul_mat_q_case`. Companion to
|
||||
`0015-paged-expert-density-aware-moe-token-tile-auto-select.patch`. Dev tree
|
||||
`~/llama-paged-dev` (branch `paged`), `build-cuda` sm_121.
|
||||
|
||||
Primary model: **Qwen3.6-35B-A3B NVFP4** (`~/bench/q36-35b-a3b-nvfp4.gguf`),
|
||||
**256 experts, top-8**, expert FFN 512, GDN linear attention (SSM inner 4096),
|
||||
41 layers. This is a different beast from 0014's Qwen3-Coder-30B-A3B (128 experts,
|
||||
larger expert FFN, standard attention).
|
||||
|
||||
## What it does (vs 0014)
|
||||
|
||||
`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max` (= `ne12`,
|
||||
the per-expert column upper bound = token count) in one column-tile, i.e. stock
|
||||
**maximizes** the tile (128 on Blackwell). Applied per expert at MoE decode, where
|
||||
per-expert density is tiny, that 128-wide tile is mostly padding.
|
||||
|
||||
Patch 0014 capped `mmq_x` globally on the ids path via `LLAMA_MOE_MMQ_X` (decode
|
||||
**and** prefill), which cost ~1.3% prefill. Patch 0015 instead estimates the
|
||||
per-expert density host-side, from args the ids path already passes:
|
||||
|
||||
```
|
||||
ne_get_rows = ncols_dst = ne12 * n_expert_used (token-expert assignments)
|
||||
n_experts = nchannels_x = ne02
|
||||
density = ceil(ne_get_rows / min(ne_get_rows, n_experts)) (tokens/expert)
|
||||
```
|
||||
|
||||
and caps to the small tile (default 64) **only when `density <= density_max`**, so
|
||||
the high-density prefill ubatch keeps the big 128 tile. Prefill-safe by construction.
|
||||
No new kernel: the selection only lowers the loop's upper bound to an
|
||||
already-compiled, granularity- and shared-memory-validated `mmq_x`.
|
||||
|
||||
## The threshold matters: `density_max = 8`, not `tile/4 = 16`
|
||||
|
||||
The cap must fire for decode but not for a prefill ubatch. Each has per-expert
|
||||
density `n_tokens * n_used / n_experts`. At the standard `n_ubatch=512`, `n_used=8`:
|
||||
|
||||
```
|
||||
128 experts 256 experts
|
||||
prefill ubatch (512) 32 16
|
||||
decode npl128 (128) 8 4
|
||||
```
|
||||
|
||||
`tile/4 = 16` (0014's first auto-select draft default) **equals the 256-expert
|
||||
prefill density** and caps prefill: measured -2.0% to -2.9% S_PP on q36-35b-a3b.
|
||||
`density_max = 8` sits strictly between decode and prefill for every `n_experts` in
|
||||
`[128, 511]`, so it caps decode and leaves prefill on the big tile. This single
|
||||
default change is what makes the patch prefill-safe on the 256-expert model.
|
||||
|
||||
## Measurements (default-on vs stock, median of 5 reps)
|
||||
|
||||
`llama-batched-bench`, q36-35b-a3b-nvfp4.gguf, `-fa on -npp 128 -ntg 128`, GB10
|
||||
sm_121. STOCK = `LLAMA_MOE_AUTO_TILE=0` (exact stock selection); 0015 = default.
|
||||
|
||||
```
|
||||
npl S_TG stock S_TG 0015 dTG% S_PP stock S_PP 0015 dPP%
|
||||
8 183.59 183.18 -0.22% 1489.2 1500.1 +0.73%
|
||||
32 264.02 263.44 -0.22% 2034.5 2033.5 -0.05%
|
||||
64 311.76 310.41 -0.43% 2028.3 2027.6 -0.03%
|
||||
128 336.10 337.32 +0.36% 2025.0 2027.7 +0.13%
|
||||
```
|
||||
|
||||
Raw npl128 reps: S_TG 0015 `[337.3, 336.9, 336.4, 338.9, 338.1]` vs stock
|
||||
`[336.2, 336.1, 335.9, 336.9, 335.8]` (distributions overlap); S_PP 0015
|
||||
`[2028.6, 2023.0, 2024.9, 2028.0, 2027.7]` vs stock `[2024.9, 2025.0, 2023.2,
|
||||
2029.4, 2029.0]`.
|
||||
|
||||
### Honest read: neutral on this model
|
||||
|
||||
On q36-35b-a3b the decode delta is **within run-to-run noise** (npl128 +0.36%,
|
||||
npl<=64 slightly negative) and prefill is **neutral** (within +/-0.7%, well inside
|
||||
the 1% target). The `+5%` decode target from the localmaxxing reference does **not**
|
||||
materialize here. q36-35b-a3b decode is bound by the GDN/SSM recurrence and
|
||||
256-tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile
|
||||
lever has nothing to bite on.
|
||||
|
||||
### npl128 decode tile sweep confirms 64 is the only useful width
|
||||
|
||||
`density_max=8` fixed, varying `LLAMA_MOE_DECODE_TILE`, S_TG @ npl128 vs stock:
|
||||
|
||||
```
|
||||
TILE8 TILE16 TILE32 TILE64 TILE96
|
||||
-6.31% -3.18% -0.17% +0.70% -0.76%
|
||||
```
|
||||
|
||||
Smaller tiles are **worse**, not better: more column-tiles per expert = more
|
||||
grid/scheduling overhead, and the FP4-MMA has a minimum efficient width. So matching
|
||||
the tile to the literal density (4) is counterproductive; 64 is the sweet spot,
|
||||
same as 0014.
|
||||
|
||||
## Why ship it default-on anyway
|
||||
|
||||
1. **Removes 0014's prefill cost by construction.** The cap is density-gated, not
|
||||
global, so prefill keeps its 128 tile (S_PP neutral above).
|
||||
2. **Banks the col-tile-bound gain for free.** At npl128 the auto-select picks
|
||||
`tile=64` for a 128-expert model (decode density 8 <= 8), i.e. exactly 0014's
|
||||
`cap64`, so it reproduces 0014's **+4.8% @npl128 on Qwen3-Coder-30B** without the
|
||||
-1.3% prefill cost. (That model was unavailable to re-bench here; the tile choice
|
||||
is identical by construction.)
|
||||
3. **Prefill-safe and decode-neutral on the SSM model**, so it is harmless where it
|
||||
does not help.
|
||||
4. **Correctness-gated** by the P0 harness (below).
|
||||
|
||||
## Conservative by design (known limitation)
|
||||
|
||||
A pure-density gate cannot separate two cases with the **same** per-expert density:
|
||||
Qwen3-Coder npl256 decode (density 16) and the 256-expert prefill ubatch (density
|
||||
16) are identical to the estimator. `density_max=8` therefore **forgoes 0014's
|
||||
+2.3% @npl256** on the 128-expert model to keep 256-expert prefill safe. Recovering
|
||||
it needs an `ne12`-aware (absolute token count) gate in addition to density; scoped
|
||||
as future work, not implemented.
|
||||
|
||||
## Knobs
|
||||
|
||||
- `LLAMA_MOE_AUTO_TILE=0` : disable the auto-select, exact stock `mmq_x` selection.
|
||||
- `LLAMA_MOE_MMQ_X=<n>` (patch 0014) : **kept** as a manual override; when > 0 it
|
||||
forces the old blunt global cap and bypasses the auto-select (explicit A/B knob).
|
||||
- `LLAMA_MOE_DECODE_TILE=<n>` : the small tile (default 64).
|
||||
- `LLAMA_MOE_DENSITY_MAX=<n>` : the density ceiling (default 8).
|
||||
|
||||
## P0 correctness gate
|
||||
|
||||
`tests/test-backend-ops` `test_mul_mat_id` is extended with a ragged small-M
|
||||
NVFP4/MXFP4 MoE decode-density block: 128 experts, top-8, m=768, k=2048, n in
|
||||
`{16,33,64,128,130,200,256,512}` spanning the cap boundary (n>=130 keeps the 128
|
||||
tile at `density_max=8`, n<=128 takes tile 64) and ragged token counts (experts with
|
||||
0/1/2 tokens, n not a multiple of the tile). All 16 shapes pass the CUDA-vs-CPU
|
||||
oracle on GB10 both default-on and with `LLAMA_MOE_AUTO_TILE=0`; full `MUL_MAT_ID`
|
||||
suite 2/2 backends OK. Off the ids path nothing changes (non-MoE `mul_mat`
|
||||
byte-identical to stock).
|
||||
|
||||
## Verdict
|
||||
|
||||
- Correct, prefill-safe, default-on density-aware tile select; the durable design
|
||||
0014's own doc scoped. Supersedes 0014's global cap as the default path; the
|
||||
`LLAMA_MOE_MMQ_X` knob is retained as a manual override.
|
||||
- **Net effect on q36-35b-a3b NVFP4: neutral** (decode within noise, prefill neutral)
|
||||
because the model is SSM/bandwidth-bound, not col-tile-bound. The lever's real win
|
||||
lives on col-tile-bound MoE (Qwen3-Coder-30B, +4.8% @npl128), banked here at zero
|
||||
prefill cost.
|
||||
Reference in New Issue
Block a user