feat(paged): add moe small-m mmq candidate trace

Assisted-by: Codex:gpt-5
2026-07-02 20:37:03 -04:00 · 2026-07-01 05:08:31 +00:00
parent e189e5a4ca
commit 70a4c31f36
8 changed files with 461 additions and 10 deletions
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -87,7 +87,7 @@ orthogonal to the paged allocator.

 ---

-## 3. Patch series (0001-0057)
+## 3. Patch series (0001-0058)

 Source-only patches, with intentional numbering gaps (e.g. 0005, 0027). The
 decode-serving graph-reuse levers are 0040-0041. "Bit-exact" = greedy md5 /
@@ -215,6 +215,7 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact
 | 0055 | **Trace speculative batch shapes** - adds default-off `LLAMA_SPEC_SHAPE_TRACE=1` server logs around `server_slot::handle_last_sampled_token()`, reporting normal decode rows and MTP verification `K + 1` rows (`draft`, `outputs`, `spec_i_first`, `spec_i_last`). This is instrumentation only for Phase 18 shape-entropy measurement before any scheduler experiment. | yes (env unset is silent; DGX gates after patch: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`) |
 | 0056 | **Trace MoE MMQ batch shapes** - adds default-off `LLAMA_MOE_MMQ_SHAPE_TRACE=<n>` logs from the grouped-MMQ host selector, reporting routed assignment count, estimated active experts, density, selected `mmq_x`, `mmq_y`, and stream-k. This is evidence-only instrumentation for sizing structural grouped-MMQ work after Phase 28 rejected launch-bounds/row-tile knobs. | yes (env unset and trace-enabled gates both green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; trace cap verified with 4 lines) |
 | 0057 | **Trace MoE MMQ launch shapes** - extends `LLAMA_MOE_MMQ_SHAPE_TRACE=<n>` with bounded `[LLAMA_MOE_MMQ_LAUNCH]` lines from `launch_mul_mat_q`, recording actual `ntiles_dst`, `stream_k_blocks`, tile efficiency, `fixup`, `ntx/nty/ntzw`, and compiled `mmq_x/mmq_y`. This is evidence-only instrumentation to distinguish real stream-k/fixup overhead from small-M kernel-shape cost. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 31 n128 trace showed decode and prefill `fixup=0`, `stream_k_blocks == ntiles_dst`) |
+| 0058 | **Trace MoE small-M MMQ candidates** - adds `LLAMA_MOE_MMQ_SMALL_M_TRACE=<n>` and a host-only classifier for decode-like low-density grouped-MMQ shapes (`ncols_max <= 128`, density `<=4`, `mmq_x_best <=64`). It only counts candidate calls for the next structural tile-policy A/B; no numeric branch is added. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 32 n128 trace found 4096 candidates, mostly `mmq_x_best=64/48`) |

 > **Dropped: patch 0026 (hybrid per-head bf16 SSM state, `ssm_bf16_tau`).** Once
 > the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache)
@@ -657,3 +658,11 @@ trace-enabled, and post-serving gates stayed stable: MoE `8cb0ce23`, dense
 `4800/4800` and prefill-like `4920/4920` launch lines with `fixup=0` and
 `stream_k_blocks == ntiles_dst`, rejecting a no-fixup/no-stream-k shortcut for
 this workload.
+
+Phase 32 added the small-M classifier trace as patch `0058`
+(`/home/mudler/bench/phase32_small_m_classifier/20260701_070127`). Default-off,
+trace-enabled, and post-serving gates stayed stable: MoE `8cb0ce23`, dense
+`5951a5b4`, `MUL_MAT_ID 806/806`. The n128 serving trace found 4096 small-M
+candidate calls: `mmq_x_best=64` 1800, `48` 1096, `40` 360, `32` 360, `16`
+360, `24` 120. This justifies Phase 33 as a default-off tile-policy A/B
+(`mmq_x=16`, possibly `8`) rather than a broad kernel rewrite.
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -1907,3 +1907,51 @@ Decision:
 - The remaining grouped-MMQ work is structural small-M kernel work, not launch
  overhead. A follow-up should target the decode-like `mmq_x <= 64`, low-density
  kernel shape directly and keep the prefill `mmq_x=128` path separate.
+
+## Phase 32 Small-M MoE MMQ Candidate Classifier
+
+Phase 32 added patch `0058`, a default-off small-M candidate trace. It does not
+change tile selection or launch behavior; it only logs
+`[LLAMA_MOE_MMQ_SMALL_M]` lines when the grouped-MMQ selector has produced a
+decode-like low-density MoE shape.
+
+Artifact:
+
+- `/home/mudler/bench/phase32_small_m_classifier/20260701_070127`
+
+Run:
+
+- Fork commit: `/home/mudler/_git/llama.cpp` `2a9964d29`
+- DGX mirror commit: `dgx:~/llama-phase6-source` `024f494d0`
+- Env: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 LLAMA_MOE_MMQ_SMALL_M_TRACE=4096`
+- Workload: h2h `n=128`, `PTOK=128`, `GEN=64`
+- Throughput while tracing: `decode_agg_tps=689.0`, `agg_tps=343.9`,
+  `prefill_tps=1566.5`, `TTFT mean=7849.0 ms`
+
+Candidate summary:
+
+| metric | notable values |
+|--------|----------------|
+| total candidates | 4096 |
+| `mmq_x_best` | `64`: 1800, `48`: 1096, `40`: 360, `32`: 360, `16`: 360, `24`: 120 |
+| density | `4`: 1440, `3`: 1336, `1`: 840, `2`: 480 |
+| `ncols_max` | `84`: 600, `128`: 360, `70`: 240, `12`: 240, `97`: 240, `126`: 240 |
+
+Gates:
+
+| check | status | actual |
+|-------|--------|--------|
+| default-off MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| default-off dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| trace-enabled MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| trace-enabled dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| post-serving MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| post-serving dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| `MUL_MAT_ID` | ok | `806/806` in all three gate runs |
+
+Decision:
+
+- There is enough live candidate coverage to justify a default-off tile-policy
+  A/B in Phase 33.
+- Start with a small-M MoE-only `mmq_x=16` cap, and consider `8` only if it
+  compiles and preserves the existing NVFP4 tile invariants.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -419,6 +419,19 @@ with `fixup=0` and `stream_k_blocks == ntiles_dst`. Do not pursue a
 no-fixup/no-stream-k shortcut for this workload; the remaining grouped-MMQ work
 is structural small-M kernel work.

+Phase 32 added patch `0058` for default-off small-M grouped-MMQ candidate
+tracing. Artifact: `/home/mudler/bench/phase32_small_m_classifier/20260701_070127`.
+Fork commit: `2a9964d29 feat(cuda): trace moe small-m mmq candidates`; DGX
+mirror commit: `024f494d0`. The trace adds `[LLAMA_MOE_MMQ_SMALL_M]` lines
+under `LLAMA_MOE_MMQ_SMALL_M_TRACE=<n>` for decode-like low-density grouped-MMQ
+MoE calls (`ncols_max <= 128`, density `<=4`, `mmq_x_best <=64`). Default-off,
+trace-enabled, and post-serving gates stayed green: MoE
+`8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. The n128 serving
+trace found 4096 candidate calls, mostly `mmq_x_best=64` (1800) and `48`
+(1096). Phase 33 should A/B a default-off small-M tile policy starting at
+`mmq_x=16`.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -468,15 +481,15 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 ## 7. KEY FILE / ARTIFACT INDEX

 ### Fork (canonical source of truth)
- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `c78e537b56e3446f8aa645c6700aacf263639bd8` ("trace moe mmq launch shapes", patch `0057`).
- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `8b75905e9` with the Phase 31 launch-trace patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes.
+- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `2a9964d290a543d14db972d8d2927ee9d2974f7e` ("trace moe small-m mmq candidates", patch `0058`).
+- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `024f494d0` with the Phase 32 small-M classifier patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes.
 - Historical DGX dev tree: `dgx:~/llama-paged-dev`, branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). It is an old experimental tree and must not be treated as canonical.

 ### LocalAI worktree
 - Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (currently 246 ahead, 31 behind `origin/master`; recompute before reporting).
 - Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`).
 - `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`.
- `patches/paged/`: **48** `.patch` files spanning 0001-0057 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055; MoE MMQ selector/launch traces are 0056-0057.
+- `patches/paged/`: **49** `.patch` files spanning 0001-0058 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055; MoE MMQ selector/launch/candidate traces are 0056-0058.

 ### Bench artifacts (DGX)
 - `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - historical same-session both-engine run.
@@ -490,6 +503,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 - `~/bench/phase29_mmq_shape_trace/20260701_042428` - default-off MoE MMQ shape trace patch `0056`; CUDA build plus default/trace md5 gates green.
 - `~/bench/phase30_mmq_shape_serving/20260701_043300` - live n128 serving MMQ shape distribution from patch `0056`; post-run md5/op gates green.
 - `~/bench/phase31_mmq_launch_trace/20260701_064424` - default-off MoE MMQ launch trace patch `0057`; default/trace/post-serving md5 gates green; n128 launch trace rejects stream-k/fixup shortcut (`fixup=0`, `stream_k_blocks == ntiles_dst`).
+- `~/bench/phase32_small_m_classifier/20260701_070127` - default-off MoE MMQ small-M classifier patch `0058`; default/trace/post-serving md5 gates green; n128 trace found 4096 candidate calls.
 - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
 - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
 - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
@@ -502,8 +516,8 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual

 ### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels)
 1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building.
-2. **Current fork/mirror are clean and verified.** Local fork HEAD is `c78e537b5`, DGX clean mirror HEAD is `8b75905e9`, and Phase 31 should be treated as the current patch-series tip. The old `llama-paged-dev` tree is historical only.
-3. **Worktree patch series is tracked through 0057.** The only expected unrelated untracked path in this worktree is `.claude/`.
+2. **Current fork/mirror are clean and verified.** Local fork HEAD is `2a9964d29`, DGX clean mirror HEAD is `024f494d0`, and Phase 32 should be treated as the current patch-series tip. The old `llama-paged-dev` tree is historical only.
+3. **Worktree patch series is tracked through 0058.** The only expected unrelated untracked path in this worktree is `.claude/`.
 4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch).
 5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign.

--- a/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md
@@ -57,18 +57,18 @@ everywhere without ever touching the stock `llama-cpp` source tree.

 ## Latest mirror check

-Phase 31 re-verified the mirror invariant after adding patch `0057`:
+Phase 32 re-verified the mirror invariant after adding patch `0058`:

 ```text
 base=0ed235ea2c17a19fc8238668653946721ed136fd
-applied_tree=4eae628e4ba6f2defa14a19d19f7e4abef9a2647
-fork_tree=4eae628e4ba6f2defa14a19d19f7e4abef9a2647
+applied_tree=de1bdd1892ab87aee947ec19c5efed8f53b93d40
+fork_tree=de1bdd1892ab87aee947ec19c5efed8f53b93d40
 ```

 The check used a fresh worktree at `LLAMA_VERSION`, applied every
 `patches/paged/0*.patch` with strict `git apply`, staged the result, and compared
 `git write-tree` to canonical fork branch `localai-paged` at
-`c78e537b5 feat(cuda): trace moe mmq launch shapes`.
+`2a9964d29 feat(cuda): trace moe small-m mmq candidates`.

 ## Status

--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -825,6 +825,31 @@ n128 serving workload. The launch code is already choosing conventional
 stream-k tiling with no fixup; the remaining gap is the small-M grouped-MMQ
 kernel shape itself, not launch/fixup overhead.

+### Phase 32 small-M MMQ candidate classifier
+
+Phase 32 added patch `0058`, a default-off
+`LLAMA_MOE_MMQ_SMALL_M_TRACE=<n>` classifier for decode-like low-density MoE
+grouped-MMQ calls. Artifact:
+`/home/mudler/bench/phase32_small_m_classifier/20260701_070127`.
+
+The default-off, trace-enabled, and post-serving gates all stayed bit-exact:
+MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
+
+Live n128 serving with `LLAMA_MOE_MMQ_SMALL_M_TRACE=4096` found 4096 candidate
+calls:
+
+| metric | notable values |
+|--------|----------------|
+| `mmq_x_best` | `64`: 1800, `48`: 1096, `40`: 360, `32`: 360, `16`: 360, `24`: 120 |
+| density | `4`: 1440, `3`: 1336, `1`: 840, `2`: 480 |
+
+Lever implication: Phase 33 should A/B a default-off small-M tile policy, first
+forcing candidate calls to `mmq_x=16` and only then trying `8` if it compiles
+and keeps the NVFP4 tile invariants. This matches the vLLM/Marlin lesson that
+low-density routed expert rows want smaller M blocks, without porting Marlin,
+Triton, TMA, tcgen05, or layout repack machinery.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0058-feat-cuda-trace-moe-small-m-mmq-candidates.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0058-feat-cuda-trace-moe-small-m-mmq-candidates.patch
@@ -0,0 +1,182 @@
+From 2a9964d290a543d14db972d8d2927ee9d2974f7e Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Wed, 1 Jul 2026 05:05:17 +0000
+Subject: [PATCH] feat(cuda): trace moe small-m mmq candidates
+
+Assisted-by: Codex:gpt-5
+---
+ ggml/src/ggml-cuda/mmq-shape-trace.h | 58 ++++++++++++++++++++++++++++
+ ggml/src/ggml-cuda/mmq.cuh           | 27 +++++++++++++
+ tests/test-cuda-mmq-shape-trace.cpp  | 35 +++++++++++++++++
+ 3 files changed, 120 insertions(+)
+
+diff --git a/ggml/src/ggml-cuda/mmq-shape-trace.h b/ggml/src/ggml-cuda/mmq-shape-trace.h
+index 98bc21f7f..47453d91f 100644
+--- a/ggml/src/ggml-cuda/mmq-shape-trace.h
+++ b/ggml/src/ggml-cuda/mmq-shape-trace.h
+@@ -37,6 +37,18 @@ struct ggml_cuda_mmq_launch_shape {
+     bool fixup_needed;
+ };
+ 
+struct ggml_cuda_mmq_small_m_shape {
+    bool is_moe;
+    int64_t ncols_dst;
+    int64_t nchannels_x;
+    int64_t ncols_max;
+    int64_t n_active_est;
+    int64_t density;
+    int mmq_x_best;
+    bool use_stream_k;
+    bool is_candidate;
+};
+
+ static inline ggml_cuda_mmq_shape ggml_cuda_mmq_shape_make(
+         const int type, const bool is_moe, const int64_t ncols_dst, const int64_t nchannels_x,
+         const int64_t ncols_max, const int mmq_x_max, const int mmq_x_lim, const int mmq_x_best,
+@@ -64,6 +76,36 @@ static inline ggml_cuda_mmq_shape ggml_cuda_mmq_shape_make(
+     };
+ }
+ 
+static inline ggml_cuda_mmq_small_m_shape ggml_cuda_mmq_small_m_shape_make(
+        const bool is_moe, const int64_t ncols_dst, const int64_t nchannels_x,
+        const int64_t ncols_max, const int mmq_x_best, const bool use_stream_k) {
+    int64_t n_active_est = 0;
+    int64_t density = 0;
+    if (is_moe && ncols_dst > 0 && nchannels_x > 0) {
+        n_active_est = ncols_dst < nchannels_x ? ncols_dst : nchannels_x;
+        density = (ncols_dst + n_active_est - 1) / n_active_est;
+    }
+
+    const bool is_candidate =
+        is_moe &&
+        use_stream_k &&
+        ncols_max > 0 && ncols_max <= 128 &&
+        density > 0 && density <= 4 &&
+        mmq_x_best > 0 && mmq_x_best <= 64;
+
+    return {
+        is_moe,
+        ncols_dst,
+        nchannels_x,
+        ncols_max,
+        n_active_est,
+        density,
+        mmq_x_best,
+        use_stream_k,
+        is_candidate,
+    };
+}
+
+ static inline ggml_cuda_mmq_launch_shape ggml_cuda_mmq_launch_shape_make(
+         const int type, const bool is_moe, const int64_t ncols_dst, const int64_t ncols_max,
+         const int mmq_x, const int mmq_y, const int ntx, const int nty, const int ntzw,
+@@ -129,3 +171,19 @@ static inline int ggml_cuda_mmq_launch_shape_format(
+         shape.stream_k_blocks,
+         shape.fixup_needed ? 1 : 0);
+ }
+
+static inline int ggml_cuda_mmq_small_m_shape_format(
+        char * buf, const size_t size, const ggml_cuda_mmq_small_m_shape & shape) {
+    return std::snprintf(buf, size,
+        "candidate=%d moe=%d ncols_dst=%lld nchannels_x=%lld ncols_max=%lld "
+        "n_active_est=%lld density=%lld mmq_x_best=%d stream_k=%d",
+        shape.is_candidate ? 1 : 0,
+        shape.is_moe ? 1 : 0,
+        (long long) shape.ncols_dst,
+        (long long) shape.nchannels_x,
+        (long long) shape.ncols_max,
+        (long long) shape.n_active_est,
+        (long long) shape.density,
+        shape.mmq_x_best,
+        shape.use_stream_k ? 1 : 0);
+}
+diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
+index 34002edf7..25ead9e7b 100644
+--- a/ggml/src/ggml-cuda/mmq.cuh
+++ b/ggml/src/ggml-cuda/mmq.cuh
+@@ -4007,6 +4007,24 @@ static inline bool ggml_cuda_moe_mmq_trace_take(std::atomic<int> & counter) {
+     return trace_index >= 0 && trace_index < trace_limit;
+ }
+ 
+static inline int ggml_cuda_moe_mmq_small_m_trace_limit() {
+    static const int limit = []() -> int {
+        const char * s = getenv("LLAMA_MOE_MMQ_SMALL_M_TRACE");
+        if (s == nullptr || strcmp(s, "0") == 0) {
+            return 0;
+        }
+        const int parsed = atoi(s);
+        return parsed > 0 ? parsed : 256;
+    }();
+    return limit;
+}
+
+static inline bool ggml_cuda_moe_mmq_small_m_trace_take(std::atomic<int> & counter) {
+    const int trace_limit = ggml_cuda_moe_mmq_small_m_trace_limit();
+    const int trace_index = trace_limit > 0 ? counter.fetch_add(1, std::memory_order_relaxed) : trace_limit;
+    return trace_index >= 0 && trace_index < trace_limit;
+}
+
+ template <ggml_type type, int mmq_x>
+ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
+     const int id = ggml_cuda_get_device();
+@@ -4294,6 +4312,15 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
+             ggml_cuda_mmq_shape_format(buf, sizeof(buf), shape);
+             fprintf(stderr, "[LLAMA_MOE_MMQ_SHAPE] %s\n", buf);
+         }
+
+        static std::atomic<int> small_m_trace_count{0};
+        const ggml_cuda_mmq_small_m_shape small_m = ggml_cuda_mmq_small_m_shape_make(
+            true, args.ncols_dst, args.nchannels_x, args.ncols_max, mmq_x_best, args.use_stream_k);
+        if (small_m.is_candidate && ggml_cuda_moe_mmq_small_m_trace_take(small_m_trace_count)) {
+            char buf[256];
+            ggml_cuda_mmq_small_m_shape_format(buf, sizeof(buf), small_m);
+            fprintf(stderr, "[LLAMA_MOE_MMQ_SMALL_M] %s\n", buf);
+        }
+     }
+ 
+     switch (mmq_x_best) {
+diff --git a/tests/test-cuda-mmq-shape-trace.cpp b/tests/test-cuda-mmq-shape-trace.cpp
+index 86ee15e02..9f36ce1a1 100644
+--- a/tests/test-cuda-mmq-shape-trace.cpp
+++ b/tests/test-cuda-mmq-shape-trace.cpp
+@@ -73,5 +73,40 @@ int main() {
+     require(std::strstr(buf, "stream_k_blocks=16") != nullptr, "launch trace includes actual stream-k block count");
+     require(std::strstr(buf, "fixup=1") != nullptr, "launch trace includes fixup flag");
+ 
+    const ggml_cuda_mmq_small_m_shape small_m = ggml_cuda_mmq_small_m_shape_make(
+        /* is_moe */ true,
+        /* ncols_dst */ 1024,
+        /* nchannels_x */ 256,
+        /* ncols_max */ 128,
+        /* mmq_x_best */ 64,
+        /* use_stream_k */ true);
+
+    require(small_m.is_candidate, "decode-like low-density MoE shape is a small-M candidate");
+    require(small_m.n_active_est == 256, "small-M active estimate is capped by expert count");
+    require(small_m.density == 4, "small-M density is ceil(assignments / active experts)");
+
+    require(!ggml_cuda_mmq_small_m_shape_make(
+        /* is_moe */ false, 1024, 256, 128, 64, true).is_candidate,
+        "dense shape is excluded");
+    require(!ggml_cuda_mmq_small_m_shape_make(
+        /* is_moe */ true, 4096, 256, 512, 128, true).is_candidate,
+        "prefill-like shape is excluded");
+    require(!ggml_cuda_mmq_small_m_shape_make(
+        /* is_moe */ true, 4096, 256, 128, 64, true).is_candidate,
+        "high-density shape is excluded");
+    require(!ggml_cuda_mmq_small_m_shape_make(
+        /* is_moe */ true, 1024, 256, 128, 128, true).is_candidate,
+        "large selected tile is excluded");
+    require(!ggml_cuda_mmq_small_m_shape_make(
+        /* is_moe */ true, 1024, 256, 128, 64, false).is_candidate,
+        "non-stream-k shape is excluded");
+
+    const int small_m_n = ggml_cuda_mmq_small_m_shape_format(buf, sizeof(buf), small_m);
+
+    require(small_m_n > 0, "small-M format returns byte count");
+    require(std::strstr(buf, "candidate=1") != nullptr, "small-M trace includes candidate flag");
+    require(std::strstr(buf, "density=4") != nullptr, "small-M trace includes density");
+    require(std::strstr(buf, "mmq_x_best=64") != nullptr, "small-M trace includes selected tile");
+
+     return 0;
+ }
--- a/docs/superpowers/plans/2026-07-01-small-m-mmq-classifier-phase32.md
+++ b/docs/superpowers/plans/2026-07-01-small-m-mmq-classifier-phase32.md
@@ -0,0 +1,72 @@
+# Small-M MMQ Classifier Phase 32 Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Add a default-off, md5-safe classifier/trace for decode-small-M MoE grouped-MMQ candidates before building any alternate numeric kernel.
+
+**Architecture:** Extend the existing host-only MMQ trace helper with a pure small-M predicate and format helper. Wire a bounded `[LLAMA_MOE_MMQ_SMALL_M]` trace in `mul_mat_q_case` after `mmq_x_best` is selected, using a separate env `LLAMA_MOE_MMQ_SMALL_M_TRACE=<n>` so normal shape tracing behavior remains unchanged.
+
+**Tech Stack:** llama.cpp CUDA backend, host-only C++ unit test, LocalAI paged patch series, DGX GB10 md5/op gates.
+
+---
+
+## Files
+
+- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq-shape-trace.h`
+  - Add `ggml_cuda_mmq_small_m_shape`, make/format helpers, and candidate predicate.
+- Modify: `/home/mudler/_git/llama.cpp/tests/test-cuda-mmq-shape-trace.cpp`
+  - Add RED/GREEN assertions for decode-like inclusion and prefill/dense exclusion.
+- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cuh`
+  - Add `LLAMA_MOE_MMQ_SMALL_M_TRACE=<n>` parser and bounded trace emission.
+- Create: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0058-feat-cuda-trace-moe-small-m-mmq-candidates.patch`
+- Modify docs: README, GB10 results, lever map, handoff, patch maintenance, and this plan.
+
+## Checklist
+
+- [x] **Step 1: RED host test**
+  - Add test calls to `ggml_cuda_mmq_small_m_shape_make` and assert candidate true for `is_moe=true`, `ncols_dst=1024`, `nchannels_x=256`, `ncols_max=128`, `mmq_x_best=64`, `use_stream_k=true`.
+  - Assert false for dense (`is_moe=false`), prefill (`ncols_max=512`), high density (`ncols_dst=4096`), large tile (`mmq_x_best=128`), and no stream-k.
+  - Run: `cmake --build build --target test-cuda-mmq-shape-trace -j 4`.
+  - Expected: compile failure because the helper does not exist.
+
+- [x] **Step 2: GREEN host helper**
+  - Add helper structs/functions in `mmq-shape-trace.h`.
+  - Run: `cmake --build build --target test-cuda-mmq-shape-trace -j 4 && ./build/bin/test-cuda-mmq-shape-trace`.
+  - Expected: pass.
+
+- [x] **Step 3: Wire default-off trace**
+  - Add `ggml_cuda_moe_mmq_small_m_trace_limit()`.
+  - Emit `[LLAMA_MOE_MMQ_SMALL_M]` only when `args.expert_bounds != nullptr`, helper says candidate, and the trace limit allows it.
+  - No numeric branch or tile change in this patch.
+
+- [x] **Step 4: DGX build and gates**
+  - Build `llama-server`, `llama-completion`, `test-backend-ops`, and `test-cuda-mmq-shape-trace`.
+  - Run default-off gates: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID 806/806`.
+  - Run trace-enabled gates with `EXTRA_ENV=LLAMA_MOE_MMQ_SMALL_M_TRACE=4`; expected same md5/op values and four small-M trace lines from MoE only.
+
+- [x] **Step 5: n128 serving count**
+  - Run h2h n128 with `LLAMA_MOE_MMQ_SMALL_M_TRACE=4096`.
+  - Parse small-M lines and compare count to Phase 30/31 decode-like launch count.
+  - Run post-serving gates.
+
+- [x] **Step 6: Mirror and docs**
+  - Commit fork, generate LocalAI patch `0058`.
+  - Verify strict patch-series tree equals fork tree.
+  - Update docs and mark this checklist complete with artifact path and decision.
+  - Commit LocalAI with `Assisted-by: Codex:gpt-5`.
+
+## Result
+
+- Fork commit: `/home/mudler/_git/llama.cpp` `2a9964d29 feat(cuda): trace moe small-m mmq candidates`.
+- DGX mirror commit: `dgx:~/llama-phase6-source` `024f494d0 feat(cuda): trace moe small-m mmq candidates`.
+- Artifact: `/home/mudler/bench/phase32_small_m_classifier/20260701_070127`.
+- RED verified: `cmake --build build --target test-cuda-mmq-shape-trace -j 4` failed on missing `ggml_cuda_mmq_small_m_shape`.
+- GREEN verified locally: `cmake --build build --target test-cuda-mmq-shape-trace -j 4 && ./build/bin/test-cuda-mmq-shape-trace`.
+- DGX CUDA build verified: `llama-server`, `llama-completion`, `test-backend-ops`, and `test-cuda-mmq-shape-trace`.
+- Default-off, trace-enabled, and post-serving gates all matched MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
+- n128 traced serving: `decode_agg_tps=689.0`, `agg_tps=343.9`, `prefill_tps=1566.5`, `TTFT mean=7849.0 ms`.
+- Small-M candidate trace: `4096` candidate calls in the first serving trace window.
+  - `mmq_x_best`: `64` 1800, `48` 1096, `40` 360, `32` 360, `16` 360, `24` 120.
+  - density: `4` 1440, `3` 1336, `1` 840, `2` 480.
+
+Decision: Phase 33 can A/B a default-off small-M tile policy, with `mmq_x=16` and possibly `8` as the first candidates. The classifier shows enough live candidate coverage to justify an opt-in tile-policy experiment, while preserving the existing MMQ path and md5 gates.
--- a/docs/superpowers/specs/2026-07-01-small-m-mmq-phase32.md
+++ b/docs/superpowers/specs/2026-07-01-small-m-mmq-phase32.md
@@ -0,0 +1,101 @@
+# Small-M MoE MMQ Phase 32 Spec
+
+## Problem
+
+Phase 30 proved n128 serving feeds grouped-MMQ with small decode-like
+per-expert shapes (`ncols_max <= 128`, density `1-4`, selected `mmq_x <= 64`).
+Phase 31 proved the obvious launch-policy shortcut is not the issue: in live
+n128 serving, all traced decode-like and prefill-like launch lines had
+`fixup=0` and `stream_k_blocks == ntiles_dst`.
+
+The remaining grouped-MMQ gap is therefore structural small-M kernel shape:
+the kernel is already launched without fixup overhead, but the work inside each
+expert tile still pays for padded, low-density token columns.
+
+## Constraints
+
+- Preserve default behavior unless an explicit experimental env/build knob is
+  set.
+- Keep the patch stack incremental: add helpers or alternate launch branches
+  instead of rewriting existing MMQ templates.
+- Prefer host-side selection shortcuts and small helper functions over broad
+  template refactors, to reduce upstream conflict risk.
+- Every source change must be gated by:
+  - `test-cuda-mmq-shape-trace` or a new host/unit test for selector behavior.
+  - DGX CUDA build of `llama-server`, `llama-completion`, `test-backend-ops`.
+  - Default-off MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`.
+  - Default-off dense md5 `5951a5b4d624ce891e22ab5fca9bc439`.
+  - `MUL_MAT_ID` `806/806`.
+  - Trace/knob-enabled md5/op gate when the experiment is expected to be
+    numerically identical.
+
+## Rejected By Evidence
+
+- No-fixup/no-stream-k shortcut: Phase 31 n128 serving had decode-like
+  `4800/4800` and prefill-like `4920/4920` launch lines with `fixup=0` and
+  `stream_k_blocks == ntiles_dst`.
+- Build-time MMQ occupancy shortcuts: Phase 28 rejected `GGML_CUDA_FP4_MINBLOCKS=2`
+  as slower and `GGML_CUDA_FP4_MMQ_Y=64` as compile-invalid for NVFP4 writeback.
+
+## Candidate Directions
+
+### A. Exact Expert Histogram Trace
+
+Add a default-off diagnostic that records exact per-expert segment lengths after
+`expert_bounds` is available. This requires care because device-to-host readback
+can synchronize the stream and perturb serving; it should run only in a
+standalone diagnostic path, never in normal serving gates.
+
+Use this only if selector estimates are insufficient for designing the next
+kernel.
+
+### B. Decode-Only Alternative Small-M Kernel Hook
+
+Add an opt-in branch for grouped MoE NVFP4 decode-like shapes:
+
+- `args.expert_bounds != nullptr`
+- `type == GGML_TYPE_NVFP4`
+- `args.ncols_max <= 128`
+- estimated density `<= 4`
+- selected `mmq_x <= 64`
+
+The first implementation should be a compile-time skeleton or dispatch counter,
+not a numeric kernel, unless the exact implementation can be tested against
+`MUL_MAT_ID` in isolation. The gate is a new `test-backend-ops` case covering
+ragged MoE decode shapes before serving A/B.
+
+### C. W4A16 / Marlin-Style Decode Probe
+
+Re-use the existing W4A16 scaffolding only as a separately gated probe. Prior
+decode W4A16 work was rejected as bandwidth-bound, while prefill remains the
+higher-EV W4A16 target. Do not mix this with the small-M MMQ branch unless a
+new in-backend A/B shows decode benefit.
+
+## Recommended Phase 32 Deliverable
+
+Do not jump straight to a large kernel. The next deliverable should be a small,
+default-off dispatch classification patch:
+
+1. Factor the Phase 30/31 decode-like predicate into a host helper.
+2. Add a test proving the helper selects only small-M grouped MoE NVFP4 shapes
+   and excludes prefill.
+3. Add a bounded log/counter prefix such as `[LLAMA_MOE_MMQ_SMALL_M]` under the
+   existing trace knob or a more specific `LLAMA_MOE_MMQ_SMALL_M_TRACE`.
+4. Re-run n128 serving to verify the candidate branch population before any
+   numeric kernel work.
+
+This keeps the next patch additive, md5-safe, and low-conflict while giving a
+hard count for the future structural branch.
+
+## Subagent Findings Folded In
+
+- llama.cpp path: `ggml_cuda_mul_mat_id` routes quantized MoE to grouped MMQ via
+  `ggml_cuda_should_use_mmq`; `mmq_args` carries `expert_bounds`, `ids_dst`,
+  `ncols_dst=ne12*n_expert_used`, `nchannels_x=ne02`, and `ncols_max=ne12`.
+- The tile selector in `mul_mat_q_case` is the correct low-conflict hook:
+  `LLAMA_MOE_MMQ_X`, `LLAMA_MOE_AUTO_TILE`, `LLAMA_MOE_DECODE_TILE`, and
+  `LLAMA_MOE_DENSITY_MAX` already prove this branch can be changed host-side.
+- vLLM's useful GB10-compatible idea is small expert `block_size_m` selection
+  (`8/16` for low-density routed rows), not TMA/tcgen05/Triton/CUTLASS paths.
+- Phase 32 should therefore add a default-off candidate classifier and trace,
+  then use the measured candidate count to decide whether to A/B `mmq_x=8/16`.