mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 01:47:18 -04:00
feat(paged): mirror patch 0014 - expert-aware MoE token-tile cap
Mirror of the dev-tree engine patch (ggml mmq.cuh) into the paged patch set, plus its measurement writeup. Adds LLAMA_MOE_MMQ_X, an opt-in env cap on the MoE grouped-GEMM token-tile (mmq_x) for the MUL_MAT_ID path; default-off = byte-identical to stock. Honest result of the MoE near-term lever: the npl128 decode cliff does NOT exist on current HEAD (stock decode is monotonic 85/282/629/935/1295/1779 t/s at npl 1/8/32/64/128/256; the old cliff was fixed upstream by the sorted grouped FP4-MMA GEMM + MoE stream-k). The cap is therefore not a cliff fix but a modest high-batch decode micro-optimization: cap64 gives +4.8% decode at npl128 and +2.3% at npl256 (reproducible, neutral at npl<=64) for a ~1.3% prefill cost; cap16/cap32 are net-negative (prefill -41% / -17%). Full tables in MOE_TOKEN_TILE_CAP.md; durable density-aware follow-up in MOE_GROUPED_GEMM_SCOPE.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -0,0 +1,140 @@
|
||||
From 652b858252b354f4d4fb49e5ed7468eeee8e32fc Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Tue, 23 Jun 2026 15:47:06 +0200
|
||||
Subject: [PATCH] feat(paged): expert-aware MoE token-tile cap (patch 0014)
|
||||
|
||||
On GB10 (sm_121) the Qwen3-30B-A3B-class mxfp4 MoE decode path already uses the
|
||||
sorted grouped FP4-MMA GEMM (MUL_MAT_ID -> ggml_cuda_mul_mat_q ids branch:
|
||||
mm_ids_helper moe_align/scatter + one persistent stream-k mul_mat_q), so the
|
||||
originally reported npl128 throughput cliff does NOT reproduce on this build.
|
||||
llama-batched-bench decode (S_TG t/s) is monotonic across batch:
|
||||
|
||||
npl 1 8 32 64 128 256
|
||||
S_TG 85 282 629 935 1295 1779 (stock, mxfp4 MoE, -fa on)
|
||||
|
||||
There is no knee to erase; the old cliff (a real high-batch regression, 620 t/s
|
||||
at npl128) was fixed upstream by grouped-mmq + MoE stream-k load balancing.
|
||||
|
||||
What remains is a pure tile-shape micro-inefficiency. In mul_mat_q_case the
|
||||
token-tile width mmq_x is chosen to cover ncols_max (= ne12, the per-expert
|
||||
column upper bound = token count, up to 128) in one column-tile. At MoE decode
|
||||
the per-expert token density is ~ne12*k/n_experts (top-8 of 128 => ~1/16 of
|
||||
ne12, e.g. ~8 tokens/expert at npl128), so each expert's single mmq_x-wide
|
||||
col-tile is only ~6% filled: the MMA accumulator tile is mmq_x-wide at compile
|
||||
time and burns throughput on the padding columns while the larger y-tile lowers
|
||||
occupancy. Stock picks the LARGEST tile (128) where the SMALLEST tile that still
|
||||
covers the density would raise fill + occupancy at no extra weight read (at
|
||||
tokens/expert <= mmq_x there is exactly one non-empty col-tile per expert; the
|
||||
emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k
|
||||
kernel) - the inverse of vLLM's small per-expert BLOCK_SIZE_M.
|
||||
|
||||
Add LLAMA_MOE_MMQ_X: an env cap on mmq_x for the MUL_MAT_ID path only
|
||||
(expert_bounds != nullptr). Default (unset or <= 0) = disabled, so the mmq_x
|
||||
selection, and therefore every kernel launched, is byte-identical to stock. The
|
||||
cap only ever lowers the loop's upper bound and still selects from the same
|
||||
granularity- and shared-memory-validated mmq_x set stock already uses for
|
||||
smaller batches, so no new kernel configuration is exercised.
|
||||
|
||||
Measured on GB10, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, same binary,
|
||||
only LLAMA_MOE_MMQ_X differs (decode S_TG t/s / prefill S_PP t/s):
|
||||
|
||||
npl stock S_TG cap64 S_TG d% stock S_PP cap64 S_PP
|
||||
64 936 938 +0.1 2924 2883
|
||||
128 1295 1357 +4.8 3075 3038
|
||||
256 1784 1825 +2.3 3085 3046
|
||||
|
||||
(reproduced across interleaved reps; cap64 npl128 = 1357.5/1357.0, very stable)
|
||||
|
||||
cap64 lifts high-batch decode +4.8% (npl128) / +2.3% (npl256), neutral at
|
||||
npl <= 64, for a consistent ~1.3% prefill cost. Smaller caps are net-negative:
|
||||
cap16 / cap32 crater prefill -41% / -17% (a 512-token prefill ubatch has ~32
|
||||
tokens/expert, which overflows a 16/32-wide tile into extra col-tiles + weight
|
||||
re-reads), so 64 is the recommended value and the only one that helps net.
|
||||
|
||||
Honest framing: this is NOT a cliff fix (no cliff exists) and not a real-server
|
||||
throughput unlock (llama-server continuous batching already scales). It is a
|
||||
modest high-effective-batch DECODE micro-optimization that matches vLLM's
|
||||
smaller per-expert M-tiling, surfaced as an opt-in, default-off knob. The
|
||||
durable density-aware auto-select (drop the blunt global cap, choose mmq_x from
|
||||
ne_get_rows / n_active_experts so prefill keeps its large tile) is scoped in
|
||||
patches/paged/MOE_GROUPED_GEMM_SCOPE.md.
|
||||
|
||||
Correctness: greedy temp-0 llama-server output with cap64 is byte-identical to
|
||||
stock for single-stream generation (fibonacci / capital-of-France / photosynthesis
|
||||
prompts) and stays coherent; batched-bench ran thousands of capped MoE matmuls at
|
||||
npl128/256 (mmq_x forced 128 -> 64) with no CUDA error / NaN and stable output.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||||
---
|
||||
ggml/src/ggml-cuda/mmq.cuh | 37 ++++++++++++++++++++++++++++++++++++-
|
||||
1 file changed, 36 insertions(+), 1 deletion(-)
|
||||
|
||||
diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
|
||||
index edf546d..cff608e 100644
|
||||
--- a/ggml/src/ggml-cuda/mmq.cuh
|
||||
+++ b/ggml/src/ggml-cuda/mmq.cuh
|
||||
@@ -6,6 +6,7 @@
|
||||
|
||||
#include <climits>
|
||||
#include <cstdint>
|
||||
+#include <cstdlib>
|
||||
|
||||
using namespace ggml_cuda_mma;
|
||||
|
||||
@@ -4052,6 +4053,18 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
|
||||
}
|
||||
}
|
||||
|
||||
+// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X.
|
||||
+// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical).
|
||||
+// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the
|
||||
+// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case).
|
||||
+static inline int ggml_cuda_moe_mmq_x_cap() {
|
||||
+ static const int cap = []() -> int {
|
||||
+ const char * s = getenv("LLAMA_MOE_MMQ_X");
|
||||
+ return s ? atoi(s) : 0;
|
||||
+ }();
|
||||
+ return cap;
|
||||
+}
|
||||
+
|
||||
template <ggml_type type>
|
||||
void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
|
||||
const int id = ggml_cuda_get_device();
|
||||
@@ -4063,10 +4076,32 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
|
||||
const int mmq_x_max = get_mmq_x_max_host(cc);
|
||||
const int mmq_y = get_mmq_y_host(cc);
|
||||
|
||||
+ // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap.
|
||||
+ // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are
|
||||
+ // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count,
|
||||
+ // up to 128) in a single column-tile. At MoE decode the per-expert token density is low
|
||||
+ // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for
|
||||
+ // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty:
|
||||
+ // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the
|
||||
+ // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the
|
||||
+ // per-expert density raises tile fill + occupancy with no extra weight reads (at
|
||||
+ // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the
|
||||
+ // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel).
|
||||
+ // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock;
|
||||
+ // off the ids path the cap never applies.
|
||||
+ int mmq_x_lim = mmq_x_max;
|
||||
+ if (args.expert_bounds != nullptr) {
|
||||
+ const int moe_cap = ggml_cuda_moe_mmq_x_cap();
|
||||
+ if (moe_cap > 0) {
|
||||
+ const int cap = moe_cap < 8 ? 8 : moe_cap;
|
||||
+ mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max;
|
||||
+ }
|
||||
+ }
|
||||
+
|
||||
int mmq_x_best = 0;
|
||||
int ntiles_x_best = INT_MAX;
|
||||
|
||||
- for (int mmq_x = 8; mmq_x <= mmq_x_max && ntiles_x_best > 1; mmq_x += 8) {
|
||||
+ for (int mmq_x = 8; mmq_x <= mmq_x_lim && ntiles_x_best > 1; mmq_x += 8) {
|
||||
const int granularity = mmq_get_granularity_host(mmq_x, cc);
|
||||
|
||||
if (mmq_x % granularity != 0 || mmq_get_nbytes_shared<type>(mmq_x, mmq_y, cc, warp_size, nwarps) > smpbo) {
|
||||
--
|
||||
2.43.0
|
||||
|
||||
99
backend/cpp/llama-cpp/patches/paged/MOE_TOKEN_TILE_CAP.md
Normal file
99
backend/cpp/llama-cpp/patches/paged/MOE_TOKEN_TILE_CAP.md
Normal file
@@ -0,0 +1,99 @@
|
||||
# Patch 0014 findings: expert-aware MoE token-tile cap (LLAMA_MOE_MMQ_X)
|
||||
|
||||
Near-term lever for the MoE-vs-vLLM workflow on GB10 (sm_121). Companion to
|
||||
`0014-paged-expert-aware-moe-token-tile-cap.patch`. Model:
|
||||
Qwen3-Coder-30B-A3B, 128 experts, top-8, mxfp4 experts
|
||||
(`~/bench/qwen3coder-mxfp4.gguf`). Dev tree `~/llama-paged-dev` (branch `paged`),
|
||||
`build-cuda` sm_121.
|
||||
|
||||
## Headline (honest): there is no npl128 cliff to erase on this build
|
||||
|
||||
The mission premise was a 25% decode drop at npl128 (batched-bench 253/505/830/620
|
||||
@ npl 8/32/64/128). It does **not** reproduce. Stock decode is monotonic:
|
||||
|
||||
```
|
||||
llama-batched-bench, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, S_TG t/s
|
||||
npl 1 8 32 64 128 256
|
||||
stock 85 282 629 935 1295 1779 <- monotonic, no knee
|
||||
```
|
||||
|
||||
The old cliff was a real high-batch regression since fixed upstream: mxfp4 MoE
|
||||
decode on GB10 already takes the sorted grouped FP4-MMA GEMM (MUL_MAT_ID ->
|
||||
`ggml_cuda_mul_mat_q` ids branch: `mm_ids_helper` moe_align/scatter + one
|
||||
persistent stream-k `mul_mat_q`), i.e. vLLM's algorithm. See
|
||||
`MOE_GROUPED_GEMM_SCOPE.md`.
|
||||
|
||||
## What the knob does
|
||||
|
||||
`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max`
|
||||
(= `ne12`, the per-expert column upper bound = token count, up to 128) in one
|
||||
column-tile. At MoE decode the per-expert density is `~ne12*k/n_experts`
|
||||
(top-8/128 => ~1/16 of `ne12`), so each expert's `mmq_x`-wide col-tile is only
|
||||
~6% filled: the MMA accumulator tile is `mmq_x`-wide at compile time and wastes
|
||||
throughput on the padding columns, and the larger y-tile lowers occupancy.
|
||||
|
||||
`LLAMA_MOE_MMQ_X=<n>` caps `mmq_x` on the MUL_MAT_ID path only
|
||||
(`expert_bounds != nullptr`). It only lowers the selection-loop upper bound and
|
||||
still chooses from the same granularity/shared-memory-validated `mmq_x` set stock
|
||||
already uses for smaller batches - no new kernel configuration. Default
|
||||
(unset/<=0) = disabled => byte-identical to stock.
|
||||
|
||||
## Measurements (same binary, only LLAMA_MOE_MMQ_X differs)
|
||||
|
||||
Decode throughput, S_TG t/s:
|
||||
|
||||
```
|
||||
npl stock cap16 cap32 cap64
|
||||
1 85 85 85 85
|
||||
8 282 280 282 282
|
||||
32 629 623 629 628
|
||||
64 935 915 949 934
|
||||
128 1295 1204 1344 1357 <- cap64 +4.8% (cap16 -7%)
|
||||
256 1779 1370 1723 1820 <- cap64 +2.3% (cap16 -23%)
|
||||
```
|
||||
|
||||
Prefill throughput, S_PP t/s (the cost):
|
||||
|
||||
```
|
||||
npl stock cap16 cap32 cap64
|
||||
128 3083 1817 2559 3038
|
||||
256 3084 1818 2560 3046
|
||||
-41% -17% -1.3%
|
||||
```
|
||||
|
||||
Reproducibility (interleaved off/cap64, two reps each):
|
||||
|
||||
```
|
||||
npl off rep1/rep2 cap64 rep1/rep2
|
||||
128 1300 / 1290 1357.5 / 1357.0
|
||||
256 1786 / 1782 1826.3 / 1824.5
|
||||
```
|
||||
|
||||
cap64 is stable to <0.1% and the gain sits well above the ~1% run-to-run band.
|
||||
|
||||
## Why 64 is the only value that helps net
|
||||
|
||||
A 512-token prefill ubatch routes ~32 tokens/expert. cap16/cap32 force those into
|
||||
16/32-wide tiles, overflowing into extra col-tiles + weight re-reads -> prefill
|
||||
craters (-41% / -17%). cap64 still holds the prefill density in one tile (32 < 64)
|
||||
so prefill is near-neutral (-1.3%), while decode (~8 tokens/expert at npl128) gets
|
||||
the fuller, higher-occupancy tile.
|
||||
|
||||
## Verdict
|
||||
|
||||
- Real but **modest** high-effective-batch DECODE micro-optimization
|
||||
(+4.8% npl128, +2.3% npl256), neutral at npl<=64, ~1.3% prefill cost at cap64.
|
||||
- **Not** a cliff fix (no cliff) and **not** a real-server unlock (llama-server
|
||||
continuous batching already scales). Shipped as an opt-in, default-off knob;
|
||||
recommended value 64 for decode-heavy high-concurrency deployments.
|
||||
- Correctness: greedy temp-0 server output with cap64 is byte-identical to stock
|
||||
for single-stream generation and stays coherent; thousands of capped MoE
|
||||
matmuls at npl128/256 ran with no CUDA error / NaN.
|
||||
|
||||
## Durable follow-up (scoped, not implemented)
|
||||
|
||||
Replace the blunt global cap with a density-aware auto-select: choose `mmq_x`
|
||||
from `ne_get_rows / n_active_experts` inside `mul_mat_q_case` so decode gets the
|
||||
small tile while prefill keeps its large tile automatically (removes the ~1.3%
|
||||
prefill cost). Plus the block-padded `moe_align` in `mm_ids_helper`. See
|
||||
`MOE_GROUPED_GEMM_SCOPE.md`.
|
||||
Reference in New Issue
Block a user