From 8925c009b75ee7f37914810cff438948a402e7e4 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 23 Jun 2026 13:17:03 +0000
Subject: [PATCH] docs(paged): scope durable grouped FP4-MMA MoE GEMM port for
 GB10

Build-ready plan (not implemented) for matching/beating vLLM MoE
grouped-GEMM efficiency on GB10 sm_121 for Qwen3-30B-A3B mxfp4.

Honest reframe: the grouped GEMM the mission scoped to build already
exists upstream and runs on GB10 for mxfp4 - should_use_mmq() routes
MUL_MAT_ID to the grouped mmq path, which already contains both vLLM
building blocks (mm_ids_helper moe_align/scatter + a persistent stream-k
FP4-MMA grouped GEMM). The npl128 cliff was a since-fixed regression, not
a batched-bench artifact; re-measured decode is monotonic 85->1771 t/s.

The one structural gap is M-tile sizing: ggml maximizes mmq_x over the
aggregate token count while vLLM uses a small per-expert BLOCK_SIZE_M, so
each tiny per-expert M-tile is 3-6% filled at decode density. Scope is a
surgical two-step delta (expert-aware mmq_x selection; block-padded
moe_align), the parity gate (test_mul_mat_id bit-exact + ragged small-M),
and a phased plan gated behind the GB10 W4A16 occupancy wall.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
 .../patches/paged/MOE_GROUPED_GEMM_SCOPE.md   | 220 ++++++++++++++++++
 1 file changed, 220 insertions(+)
 create mode 100644 backend/cpp/llama-cpp/patches/paged/MOE_GROUPED_GEMM_SCOPE.md

diff --git a/backend/cpp/llama-cpp/patches/paged/MOE_GROUPED_GEMM_SCOPE.md b/backend/cpp/llama-cpp/patches/paged/MOE_GROUPED_GEMM_SCOPE.md
new file mode 100644
index 000000000..f5f26fe61
--- /dev/null
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_GROUPED_GEMM_SCOPE.md
@@ -0,0 +1,220 @@
+# Durable scope: grouped FP4-MMA MoE GEMM for ggml CUDA on GB10 (sm_121)
+
+Build-ready plan. **Not implemented in this workflow** (large kernel work). This
+document scopes the durable path to match or beat vLLM MoE grouped-GEMM efficiency
+on GB10 for the Qwen3-30B-A3B-class mxfp4 MoE, and records the single honest
+finding that re-shapes the whole effort.
+
+Hardware: NVIDIA GB10 (sm_121, CC=1210 = `GGML_CUDA_CC_DGX_SPARK`), unified
+LPDDR5X ~273 GB/s. Model: Qwen3-Coder-30B-A3B, 128 experts, top-8, mxfp4 experts
+(`~/bench/qwen3coder-mxfp4.gguf`). Dev tree `~/llama-paged-dev` (branch `paged`,
+HEAD at patch 0013), `build-cuda` sm_121.
+
+## TL;DR (the honest reframe)
+
+**The grouped GEMM the mission scoped to build from scratch already exists in
+upstream ggml, and it already runs on GB10 for mxfp4.** For mxfp4 experts on
+sm_121 `ggml_cuda_should_use_mmq()` returns true (`turing_mma_available`), so
+MUL_MAT_ID takes the **grouped mmq path**, which already contains both vLLM
+building blocks:
+
+1. a moe_align / token-sort-by-expert (`mmid.cu` `mm_ids_helper`:
+   count -> warp-scan/cumsum -> scatter into expert-sorted contiguous buffers),
+2. a **single persistent stream-k grouped FP4-MMA GEMM** (one `mul_mat_q` launch;
+   grid flattened into kbc-continuous space over expert x col-tile x row-tile x
+   k-block; native FP4 MMA via `block_fp4_mmq` under `BLACKWELL_MMA_AVAILABLE`).
+
+The per-expert host-side row-gather loop in `ggml-cuda.cu`
+`ggml_cuda_mul_mat_id()` (~L2632-2790) - the path the mission's root-cause
+analysis describes as "the cliff" - is a **fallback only reached when
+`should_use_mmq()==false`** (f16/bf16 experts, non-Blackwell). It is **never the
+GB10 mxfp4 path.**
+
+Consequence: the "npl128 MoE cliff" does not exist on the current dev HEAD.
+Re-measured batched-bench decode (`S_TG` t/s) on the mxfp4 MoE rises monotonically
+`85 / 278 / 637 / 950 / 1306 / 1771` at npl `1 / 8 / 32 / 64 / 128 / 256`. The
+original `253/505/830/620` cliff was a real high-batch regression that has since
+been **fixed upstream** (FP4-native grouped mmq + MoE stream-k balancing), not a
+batched-bench artifact.
+
+**Therefore the durable work is NOT "port moe_align + a grouped GEMM."** It is a
+**surgical fix to the one place ggml diverges from vLLM: the M-tile (token-tile)
+sizing heuristic.** This document scopes that delta, plus the optional
+block-padded align, plus the parity gate and phased plan. It also records what is
+intentionally NOT built and why (the W4A16 occupancy wall).
+
+## The one structural gap: M-tile sizing
+
+`mul_mat_q_case` / `launch_mul_mat_q` pick `mmq_x` (the token/M tile) by
+**minimizing** `ntiles_x = ceil(ncols_max / mmq_x)` over the **aggregate** token
+count (`ncols_max = ne12`). On Blackwell `get_mmq_x_max = 128`, so the heuristic
+always selects the **largest** `mmq_x` that fits shared memory. vLLM's
+CUTLASS/Triton fused_moe does the **opposite**: a small tuned `BLOCK_SIZE_M`
+(typ. 16/32/64), padded **per expert**.
+
+ggml then applies its over-large `mmq_x` **per expert**. In MoE decode the tokens
+per expert is tiny - Qwen3-30B-A3B top-8 of 128: at npl64 ~512 assignments over
+~126 activated experts ~= 4 tok/expert; at npl128 ~1024 over ~128 ~= 8 tok/expert.
+So each expert's single M-tile of width 128 is **3-6% filled** -> ragged tiny-M
+tiles run a dense-GEMM-tuned config, wasting MMA M-throughput, and (with
+`need_check`) every expert runs as a masked partial tail.
+
+The FP4 MMA N-fragment (`tile_C::J`) is 8, so the **ideal M-tile ~= tokens/expert
+(~8)**, 16x smaller than the 128 ggml picks. This mismatch is the durable gap.
+
+Critically for GB10: at tokens/expert <= 8 there is exactly **one col-tile per
+expert**, so a smaller `mmq_x` causes **no extra weight re-read** (weight rows are
+re-read only across multiple col-tiles, of which there is one) while it **lowers
+shared-mem footprint and raises occupancy** - strictly aligned with the GB10
+occupancy lessons.
+
+## What already exists (reuse, do NOT rebuild)
+
+Engine files on DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`:
+
+- **[A] moe_align / scatter** = `mmid.cu` `mm_ids_helper`. One CUDA block per
+  expert (`gridDim.x = n_experts`); warp counts tokens routed to this expert,
+  warp-scan for the compaction index, scatters into `ids_src1` (column gather
+  permutation, expert-sorted contiguous), `ids_dst` (output scatter), and writes
+  `expert_bounds[expert] = prefix start`, `expert_bounds[n_experts] = total`.
+  This **is** count -> cumsum -> permute; `expert_bounds` is the analogue of
+  vLLM's `num_tokens_post_padded` boundaries. No `-1` pad today because segments
+  are exact (not block-padded).
+- **[B] persistent grouped FP4 GEMM** = `mmq.cuh` `mul_mat_q` stream-k
+  (kernel ~L3542, `process_tile` ~L3447, launch ~L3943, case-select ~L4055).
+  Single launch, fixed grid (`nsm` CTAs, or `ntiles` when >=90% tile efficiency).
+  Each CTA walks a contiguous `kbc` slice of (expert `zt` via `expert_bounds`,
+  col-tile `jt`, row-tile `it`, k-block) space; the weight row-tile (`mmq_y=128`
+  x K) is loaded once per col-tile in the `process_tile` k-loop; empty col-tiles
+  past `col_diff` are SKIPPED by advancing `kbc += blocks_per_ne00`; a
+  `stream_k_fixup` pass recombines split tiles.
+- **[C] native FP4-MMA expert weights** = `block_fp4_mmq` + `MMQ_MMA_TILE_X_K_FP4`
+  (== Q8_1 tile, skew-pad +4) under `BLACKWELL_MMA_AVAILABLE`;
+  `quantize_mmq_fp4_cuda` quantizes activations to the q8-style y-layout **with
+  the `ids_src1` gather fused** (one pass, no separate row-copy).
+
+Dispatch seam: `ggml-cuda.cu` `ggml_cuda_mul_mat_id()` (~L2632-2790). For mxfp4
+with `ne2`(tokens) > 7, `should_use_mmq()` -> true -> `ggml_cuda_mul_mat_q()`
+(`mmq.cu` id-branch ~L162-225) -> `mm_ids_helper` then ONE
+`mul_mat_q_switch_type`. The per-expert host loop below it is the gated fallback.
+
+(Below npl8, MXFP4 mmid routes through `mmvq` - `MMVQ_MAX_BATCH_SIZE=8`, mmid max
+7 for turing_plus - which is fine for thin batch and out of scope here.)
+
+## What to add (the durable delta, priority order)
+
+### [1] Expert-aware M-tile selection (host-side only, zero new kernel)
+
+In `mul_mat_q_case` / `launch_mul_mat_q`, when `ids != null`, choose `mmq_x` from
+**per-expert density** (~`ne_get_rows / n_active_experts`, derivable cheaply, or
+capped via env) instead of minimizing `ntiles` over aggregate `ncols_max`.
+
+- `mmq_x` is a **compile-time template** (switch 8..128 step 8), so this is a pure
+  host-side SELECTION change - it picks a different already-compiled instantiation.
+  **Zero new kernel. Very low risk, high leverage.** Matches vLLM `BLOCK_SIZE_M`.
+- Doubles as near-term lever-1: env-gated `LLAMA_MOE_MMQ_X` cap at the knee.
+- GB10-aligned: smaller `mmq_x` -> smaller shared mem -> higher occupancy, and at
+  tokens/expert <= 8 (one col-tile/expert) it costs no extra weight read.
+
+This is the single highest-leverage change and the seed of the durable port.
+
+### [2] Block-padded moe_align (the moe_align_block_size port proper)
+
+Extend `mm_ids_helper` to pad each expert segment up to a multiple of the chosen
+block: write a sentinel (`-1`) `ids_dst` for pad lanes, put `expert_bounds` on
+block boundaries. Then every col-tile is **full**, which:
+
+- drops the `need_check` masking + per-expert partial-tail MMA,
+- makes the stream-k `kbc` space exact (no skipped tiles, cleaner persistent
+  schedule), removing the `col_diff` skip branch.
+
+Medium risk: touches the scatter, the `col_diff`/`need_check` logic, and the
+`write_back` masking (pad rows must not write output). This is the proper
+`moe_align_block_size` analogue and the durable second step.
+
+### [3] Bespoke masked-grouped FP4 kernel - ONLY if [1]+[2] insufficient
+
+A CUTLASS/DeepGEMM-style masked-grouped FP4 kernel. **Largest risk, likely
+unnecessary** given [B] is already a persistent stream-k grouped GEMM. Listed for
+completeness; do not start without [1]+[2] measured as insufficient.
+
+## Integration into ggml_mul_mat_id (dispatch seam + gated fallback)
+
+- The seam is unchanged: `ggml_cuda_mul_mat_id()` -> `should_use_mmq()` ->
+  `ggml_cuda_mul_mat_q()`. [1] and [2] live entirely inside the mmq id-branch
+  (`mmq.cu` ~L162-225) and its callees (`mmq.cuh` selection/launch, `mmid.cu`
+  scatter). No change to the host dispatch decision.
+- **Gated fallback preserved**: the existing per-expert host loop
+  (`should_use_mmq()==false` path) stays as-is for f16/bf16 experts and
+  non-Blackwell GPUs. The new selection only fires on the grouped path.
+- **Env gates** (off = exact current behavior):
+  - `LLAMA_MOE_MMQ_X=<8..128>` - cap/override the token tile for the id-path
+    (lever-1 + [1] manual knob).
+  - `LLAMA_MOE_BLOCK_ALIGN=0|1` - enable block-padded scatter ([2]).
+  Default both off until parity + throughput proven, then flip [1]'s
+  auto-selection on by default.
+
+## Correctness / parity gate
+
+Primary: `tests/test-backend-ops.cpp` `test_mul_mat_id` (~L4181). The CPU
+reference is **deterministic** - the op test must be **bit-exact**.
+
+- Sweep `type_a` in {`MXFP4`, `NVFP4`}, `type_b = F32`, `n_mats = 128`,
+  `n_expert_used = 8`, `n_tokens` in {8, 32, 64, 128} (the decode-density band).
+- **Add ragged small-M shapes** to the harness if absent (n_tokens not a multiple
+  of mmq_x; experts with 0/1/2 tokens) - these are exactly where [1]/[2] change
+  tile geometry and where block-pad masking can leak.
+- Pass criterion: new `mmq_x` selection and padded-align produce dst **identical**
+  to current op-test output (op test is exact; the GB10 CUDA greedy-decode
+  non-determinism band applies only to end-to-end, never to the op test).
+- End-to-end sanity: `llama-batched-bench` on `~/bench/qwen3coder-mxfp4.gguf`,
+  `-fa on -npp 128 -ntg 128`, npl 8/32/64/128/256; confirm `S_TG` stays monotonic
+  and `S_PP` flat ~3050-3090. Verify greedy-decode output within the documented
+  CUDA batch-shape non-determinism band (CPU is the deterministic oracle).
+
+Bench/parity scripts stay **dev-tree-only** (`~/llama-paged-dev/benches/`).
+
+## Phased plan, expected payoff, risk per phase
+
+| Phase | Work | Expected payoff | Risk |
+|-------|------|-----------------|------|
+| **P0** harness | Add ragged small-M + MXFP4/NVFP4 mmid shapes to `test_mul_mat_id`; capture current bit-exact baseline + the monotonic batched-bench curve as the reference. | None (gate). Locks correctness + the 85->1771 t/s baseline so any regression is caught. | Low. |
+| **P1** sort op | Confirm `mm_ids_helper` is the moe_align; if [2] is pursued, prototype the block-pad scatter behind `LLAMA_MOE_BLOCK_ALIGN`. | Enables exact stream-k schedule; removes `need_check` masking (P3 payoff). | Medium (scatter + write-back masking). |
+| **P2** grouped GEMM ([1]) | Expert-aware `mmq_x` selection in `mul_mat_q_case`/launch, `LLAMA_MOE_MMQ_X` gate. | The headline: reclaim the 3-6% M-tile fill waste at npl64-128. Modeled as removing wasted MMA M-throughput on every activated expert; net throughput up at high batch with no extra weight read. | **Low** (host-side template selection, no new kernel). |
+| **P3** tune ([2] + fixup) | Land block-padded align; tune `mmq_x` per density, profile stream-k `fixup` overhead and `mmq_x`/`mmq_y` tile choice with nsys on the grouped `mul_mat_q<MXFP4>` kernel. | Remove per-expert partial-tail MMA; tighten the persistent schedule. Diminishing vs P2; this is pure micro-efficiency toward/past vLLM's saturated grouped-GEMM. | Medium-high (kernel masking paths). |
+
+**Honest payoff framing:** the npl128 "cliff" is already gone on HEAD, so there is
+no broken path to unlock. The durable win is **matching vLLM's saturated
+grouped-GEMM M-tiling** (small per-expert block) and erasing the dense-GEMM-tuned
+M-tile mismatch - a micro-efficiency gain at large effective batch, not a
+step-change. vLLM 0.23.0 cannot even serve this model on GB10 (bf16 MoE-warmup
+hang + hard reboot; GGUF loader can't map fused qwen3moe experts), and llama
+already uses the same sorted-grouped-GEMM algorithm, so structural parity is
+**already met**; this closes the residual kernel micro-gap.
+
+## The biggest risk: the GB10 W4A16 occupancy wall
+
+The dominant risk is **repeating the W4A16 dead-end** that hit only ~9 TFLOPS /
+178 t/s on GB10. GB10 is **occupancy-dominated**: deep `cp.async` pipelines and
+XOR-swizzle shared layouts **collapse occupancy** there. Any P3 kernel work MUST:
+
+- keep **small shared mem + high occupancy** (do NOT add deep `cp.async` stages
+  or XOR-swizzle - they are exactly what killed W4A16);
+- preserve the **skew-pad (+4)** tile layout already in `MMQ_MMA_TILE_X_K_FP4`;
+- stay on the **FP4-MMA path** (`block_fp4_mmq`), the only path that hits Blackwell
+  FP4 = 2x INT8/BF16 rate;
+- respect the ~273 GB/s LPDDR5X weight-read floor (dense decode is already at it;
+  MoE wins come from occupancy/tile fit, not bandwidth).
+
+Smaller `mmq_x` ([1]) is **strictly consistent** with these lessons: it reduces
+shared-mem footprint, raises occupancy, and at tokens/expert <= 8 adds no weight
+re-read. So the low-risk lever ([1]) is also the one most aligned with what GB10
+rewards - which is why it leads the plan and [3] is gated behind it.
+
+## Commit / hygiene
+
+Scope doc only (this file). No engine change committed in this workflow. Bench and
+parity scripts are dev-tree-only. Commit with `git -s`, trailer
+`Assisted-by: Claude:opus-4.8 [Claude Code]`, no `Co-Authored-By`, no em-dashes.
+Do not push (human pushes). When [1]/[2] are implemented they mirror to
+`backend/cpp/llama-cpp/patches/paged/0014-*` (next free slot).