diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md
index 97f45deac..808bb326c 100644
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -87,7 +87,7 @@ orthogonal to the paged allocator.
 
 ---
 
-## 3. Patch series (0001-0060)
+## 3. Patch series (0001-0061)
 
 Source-only patches, with intentional numbering gaps (e.g. 0005, 0027). The
 decode-serving graph-reuse levers are 0040-0041. "Bit-exact" = greedy md5 /
@@ -218,6 +218,7 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact
 | 0058 | **Trace MoE small-M MMQ candidates** - adds `LLAMA_MOE_MMQ_SMALL_M_TRACE=<n>` and a host-only classifier for decode-like low-density grouped-MMQ shapes (`ncols_max <= 128`, density `<=4`, `mmq_x_best <=64`). It only counts candidate calls for the next structural tile-policy A/B; no numeric branch is added. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 32 n128 trace found 4096 candidates, mostly `mmq_x_best=64/48`) |
 | 0059 | **Gate MoE small-M MMQ tile policy** - adds default-off `LLAMA_MOE_SMALL_M_TILE=<n>` to cap only classified small-M MoE grouped-MMQ calls. This was used to A/B vLLM-like smaller M blocks without changing default inference. | yes (default-off, tile16, tile8, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 33 rejected tile16 and tile8 as slower) |
 | 0060 | **Trace MoE MMID dispatch routes** - adds default-off `LLAMA_MOE_MMID_ROUTE_TRACE=<n>` around `MUL_MAT_ID` dispatch, classifying each call as `mmvq`, `mmvf`, grouped `mmq`, `mmf`, or host-sync `fallback`. This is evidence-only instrumentation to resolve whether serving hits the per-expert host-sync fallback. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 34 n128 trace found `mmq=2776`, `mmvq=1320`, `host_sync=0/4096`) |
+| 0061 | **Trace regular MUL_MAT dispatch routes** - adds default-off `LLAMA_MUL_MAT_ROUTE_TRACE=<n>` around regular `MUL_MAT`, classifying projection-heavy calls as `vec_f`, `mat_f`, `vec_q`, `mmq`, `batched_cublas`, `op_*`, `fp4_prefill`, or `fwht`. This is evidence-only instrumentation for the `bf16-proj` serving bucket. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`; Phase 35 n128 trace found BF16 routes `mat_f=2485`, `op_cublas=1330`) |
 
 > **Dropped: patch 0026 (hybrid per-head bf16 SSM state, `ssm_bf16_tau`).** Once
 > the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache)
@@ -683,3 +684,13 @@ trace-enabled, and post-serving gates stayed stable: MoE `8cb0ce23`, dense
 `mmq ne2=12` (1096), `mmq ne2=18` (480), and `mmvq ne2=8` (360). This refutes
 host-sync fallback as the current n128 `MUL_MAT_ID` problem; follow-up work should
 target grouped-MMQ small-M kernel partitioning or another measured bucket.
+
+Phase 35 added default-off `LLAMA_MUL_MAT_ROUTE_TRACE=<n>` as patch `0061`
+(`/home/mudler/bench/phase35_mul_mat_route_trace/20260701_074359`). Default-off,
+trace-enabled, and post-serving gates stayed stable: MoE `8cb0ce23`, dense
+`5951a5b4`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. Live n128 serving with
+trace cap 8192 produced route counts: `mat_f=2888`, `op_cublas=2292`,
+`mmq=1328`, `vec_q=1214`, `vec_f=470`. BF16 (`type=30`) dominated the trace
+with `mat_f=2485` and `op_cublas=1330`; top BF16 shapes were `mat_f ne1=12`
+(775), `op_cublas ne1=18` (760), and `mat_f ne1=8` (570). Next projection work
+should trace or optimize the BF16 `op_cublas`/`mat_f` split, not batched cuBLAS.
diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
index af287fade..e36fd74a2 100644
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -2038,3 +2038,69 @@ Decision:
   grouped MMQ above that.
 - Do not scope the next parity phase around avoiding fallback dispatch. Scope it
   around grouped-MMQ small-M kernel partitioning or another measured bucket.
+
+## Phase 35 Regular MUL_MAT Route Trace
+
+Phase 35 added patch `0061`, a default-off `LLAMA_MUL_MAT_ROUTE_TRACE=<n>`
+diagnostic around regular `MUL_MAT` dispatch. It does not alter routing; it logs
+the existing route decision for projection-heavy calls.
+
+Artifact:
+
+- `/home/mudler/bench/phase35_mul_mat_route_trace/20260701_074359`
+
+Run:
+
+- Fork commit: `/home/mudler/_git/llama.cpp` `486c28c63`
+- DGX mirror commit: `dgx:~/llama-phase6-source` `18f7ad005`
+- Env: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 LLAMA_MUL_MAT_ROUTE_TRACE=8192`
+- Workload: staggered n128 `llama-server`, `GEN=64`
+
+Route summary:
+
+| route | count |
+|-------|-------|
+| `mat_f` | 2888 |
+| `op_cublas` | 2292 |
+| `mmq` | 1328 |
+| `vec_q` | 1214 |
+| `vec_f` | 470 |
+
+Type summary:
+
+| type | meaning | count |
+|------|---------|-------|
+| 30 | BF16 | 3965 |
+| 40 | NVFP4 | 2542 |
+| 0 | F32 | 1685 |
+
+Top BF16 route/shape counts:
+
+| route | shape | count |
+|-------|-------|-------|
+| `mat_f` | `ne1=12 ne11=12 ne12=1 ne13=1` | 775 |
+| `op_cublas` | `ne1=18 ne11=18 ne12=1 ne13=1` | 760 |
+| `mat_f` | `ne1=8 ne11=8 ne12=1 ne13=1` | 570 |
+| `op_cublas` | `ne1=36 ne11=36 ne12=1 ne13=1` | 380 |
+| `mat_f` | `ne1=2 ne11=2 ne12=1 ne13=1` | 380 |
+
+Gates:
+
+| check | status | actual |
+|-------|--------|--------|
+| default-off MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| default-off dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| trace-enabled MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| trace-enabled dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| post-serving MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| post-serving dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| `MUL_MAT` | ok | `1146/1146` in all three gate runs |
+| `MUL_MAT_ID` | ok | `806/806` in all three gate runs |
+
+Decision:
+
+- The first 8192 regular `MUL_MAT` calls in n128 serving are dominated by BF16
+  direct `mat_f` and generic `op_cublas`, not batched cuBLAS.
+- Next projection work should either add a cuBLAS/MMF subroute trace or test a
+  bounded BF16 route policy for the `op_cublas` shapes. Do not chase batched
+  cuBLAS for this measured serving slice.
diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
index 1f8a9664d..bbaf512c0 100644
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -454,6 +454,18 @@ Treat the old current-stack host-sync-fallback concern as refuted for this
 workload; the remaining MoE work is grouped-MMQ small-M efficiency or another
 measured bucket.
 
+Phase 35 added patch `0061`, default-off `LLAMA_MUL_MAT_ROUTE_TRACE=<n>`, to
+classify regular `MUL_MAT` routes for the projection-heavy serving bucket.
+Artifact: `/home/mudler/bench/phase35_mul_mat_route_trace/20260701_074359`.
+Fork commit: `486c28c63 feat(cuda): trace mul mat routes`; DGX mirror commit:
+`18f7ad005`. Default-off, trace-enabled, and post-serving gates stayed green:
+MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
+`806/806`. Live n128 serving with trace cap 8192 found `mat_f=2888`,
+`op_cublas=2292`, `mmq=1328`, `vec_q=1214`, `vec_f=470`; BF16 (`type=30`)
+was split `mat_f=2485`, `op_cublas=1330`. Next projection work should target
+BF16 `mat_f`/`op_cublas` subroute evidence or route policy, not batched cuBLAS.
+
 ---
 
 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -503,15 +515,15 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 ## 7. KEY FILE / ARTIFACT INDEX
 
 ### Fork (canonical source of truth)
-- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `6c332094ca2fbb1e3211427c5f919adcaa89c588` ("trace moe mmid routes", patch `0060`).
-- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `34a256d14` with the Phase 34 MMID route-trace patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes.
+- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `486c28c63d5297afd06e5a2bdbd4fb89cad749cd` ("trace mul mat routes", patch `0061`).
+- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `18f7ad005` with the Phase 35 regular MUL_MAT route-trace patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes.
 - Historical DGX dev tree: `dgx:~/llama-paged-dev`, branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). It is an old experimental tree and must not be treated as canonical.
 
 ### LocalAI worktree
 - Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (currently 246 ahead, 31 behind `origin/master`; recompute before reporting).
 - Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`).
 - `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`.
-- `patches/paged/`: **51** `.patch` files spanning 0001-0060 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055; MoE MMQ selector/launch/candidate/tile-policy/route instrumentation is 0056-0060.
+- `patches/paged/`: **52** `.patch` files spanning 0001-0061 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055; MoE MMQ selector/launch/candidate/tile-policy/route instrumentation is 0056-0060; regular MUL_MAT route instrumentation is 0061.
 
 ### Bench artifacts (DGX)
 - `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - historical same-session both-engine run.
@@ -528,6 +540,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 - `~/bench/phase32_small_m_classifier/20260701_070127` - default-off MoE MMQ small-M classifier patch `0058`; default/trace/post-serving md5 gates green; n128 trace found 4096 candidate calls.
 - `~/bench/phase33_small_m_tile_policy/20260701_071136` - default-off MoE MMQ small-M tile policy patch `0059`; tile16/tile8 md5/op safe but both slower in n128 serving.
 - `~/bench/phase34_mmid_route_trace/20260701_072737` - default-off MoE MMID route trace patch `0060`; default/trace/post-serving md5 gates green; n128 route trace found `mmq=2776`, `mmvq=1320`, `host_sync=0/4096`.
+- `~/bench/phase35_mul_mat_route_trace/20260701_074359` - default-off regular MUL_MAT route trace patch `0061`; default/trace/post-serving md5 gates green; n128 route trace found BF16 `mat_f=2485`, `op_cublas=1330`.
 - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
 - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
 - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
@@ -540,8 +553,8 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 
 ### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels)
 1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building.
-2. **Current fork/mirror are clean and verified.** Local fork HEAD is `6c332094c`, DGX clean mirror HEAD is `34a256d14`, and Phase 34 should be treated as the current patch-series tip. The old `llama-paged-dev` tree is historical only.
-3. **Worktree patch series is tracked through 0060.** The only expected unrelated untracked path in this worktree is `.claude/`.
+2. **Current fork/mirror are clean and verified.** Local fork HEAD is `486c28c63`, DGX clean mirror HEAD is `18f7ad005`, and Phase 35 should be treated as the current patch-series tip. The old `llama-paged-dev` tree is historical only.
+3. **Worktree patch series is tracked through 0061.** The only expected unrelated untracked path in this worktree is `.claude/`.
 4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch).
 5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign.
 
diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md b/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md
index e50bdda27..13457afa5 100644
--- a/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md
@@ -57,18 +57,18 @@ everywhere without ever touching the stock `llama-cpp` source tree.
 
 ## Latest mirror check
 
-Phase 34 re-verified the mirror invariant after adding patch `0060`:
+Phase 35 re-verified the mirror invariant after adding patch `0061`:
 
 ```text
 base=0ed235ea2c17a19fc8238668653946721ed136fd
-applied_tree=433720590dfafbde8cc5b23a80e13f88349ff90f
-fork_tree=433720590dfafbde8cc5b23a80e13f88349ff90f
+applied_tree=305ebb96801822f2132ed9e9c868308b0759c7b9
+fork_tree=305ebb96801822f2132ed9e9c868308b0759c7b9
 ```
 
 The check used a fresh worktree at `LLAMA_VERSION`, applied every
 `patches/paged/0*.patch` with strict `git apply`, staged the result, and compared
 `git write-tree` to canonical fork branch `localai-paged` at
-`6c332094c feat(cuda): trace moe mmid routes`.
+`486c28c63 feat(cuda): trace mul mat routes`.
 
 ## Status
 
diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
index d64b71b10..f3ccbe57e 100644
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -898,6 +898,39 @@ serving might fall into the per-expert host-sync fallback is refuted for this
 stack. The remaining MoE route issue is grouped-MMQ small-M efficiency, not
 fallback dispatch avoidance.
 
+### Phase 35 regular MUL_MAT route trace
+
+Phase 35 added patch `0061`, default-off `LLAMA_MUL_MAT_ROUTE_TRACE=<n>`, to
+classify regular `MUL_MAT` routes for the `bf16-proj` serving bucket. Artifact:
+`/home/mudler/bench/phase35_mul_mat_route_trace/20260701_074359`.
+
+Default-off, trace-enabled, and post-serving gates were all bit-exact: MoE
+`8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
+`806/806`.
+
+Live n128 serving with `LLAMA_MUL_MAT_ROUTE_TRACE=8192` produced:
+
+| route | count |
+|-------|-------|
+| `mat_f` | 2888 |
+| `op_cublas` | 2292 |
+| `mmq` | 1328 |
+| `vec_q` | 1214 |
+| `vec_f` | 470 |
+
+The trace was BF16-heavy (`type=30`: 3965 calls), mostly `mat_f=2485` and
+`op_cublas=1330`. Top BF16 shapes were `mat_f ne1=12` (775),
+`op_cublas ne1=18` (760), and `mat_f ne1=8` (570); `ne12=ne13=1` throughout the
+top shapes, so batched cuBLAS is not the measured target.
+
+Lever implication: the next projection phase should add cuBLAS/MMF subroute
+detail or test a narrow BF16 route policy for the generic `op_cublas` shapes.
+Do not spend time on batched cuBLAS for this n128 serving slice. If MTP is enabled
+in a future serving configuration, first isolate `mtp_eh_proj` / shared-head
+projection with `llama-debug --tensor-filter 'mtp_|h_nextn|nextn|ffn_|attn_'`
+before optimizing ordinary decoder projections.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
 
 ### Phase 10 GDN C32 slab update
diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0061-feat-cuda-trace-mul-mat-routes.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0061-feat-cuda-trace-mul-mat-routes.patch
new file mode 100644
index 000000000..2099568d0
--- /dev/null
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0061-feat-cuda-trace-mul-mat-routes.patch
@@ -0,0 +1,345 @@
+From 486c28c63d5297afd06e5a2bdbd4fb89cad749cd Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Wed, 1 Jul 2026 05:49:12 +0000
+Subject: [PATCH] feat(cuda): trace mul mat routes
+
+Add a default-off LLAMA_MUL_MAT_ROUTE_TRACE diagnostic for regular MUL_MAT dispatch routes.
+
+The trace classifies projection-heavy calls without changing dispatch behavior.
+
+Assisted-by: Codex:gpt-5
+---
+ ggml/src/ggml-cuda/ggml-cuda.cu      |  44 +++++++++-
+ ggml/src/ggml-cuda/mmq-shape-trace.h | 117 +++++++++++++++++++++++++++
+ tests/test-cuda-mmq-shape-trace.cpp  |  85 +++++++++++++++++++
+ 3 files changed, 244 insertions(+), 2 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
+index a1754df39..cd34aff13 100644
+--- a/ggml/src/ggml-cuda/ggml-cuda.cu
++++ b/ggml/src/ggml-cuda/ggml-cuda.cu
+@@ -2580,6 +2580,32 @@ static bool ggml_cuda_should_fuse_mul_mat_vec_q(const ggml_tensor * tensor) {
+     return use_mul_mat_vec_q;
+ }
+ 
++static inline int ggml_cuda_mul_mat_route_trace_limit() {
++    static const int value = []() {
++        const char * s = getenv("LLAMA_MUL_MAT_ROUTE_TRACE");
++        return s ? atoi(s) : 0;
++    }();
++
++    return value;
++}
++
++static inline void ggml_cuda_mul_mat_route_trace(const ggml_cuda_mul_mat_route_shape & shape) {
++    const int trace_limit = ggml_cuda_mul_mat_route_trace_limit();
++    if (trace_limit <= 0) {
++        return;
++    }
++
++    static std::atomic<int> trace_count{0};
++    const int trace_idx = trace_count.fetch_add(1, std::memory_order_relaxed);
++    if (trace_idx >= trace_limit) {
++        return;
++    }
++
++    char buf[256];
++    ggml_cuda_mul_mat_route_shape_format(buf, sizeof(buf), shape);
++    fprintf(stderr, "[LLAMA_MUL_MAT_ROUTE] %s\n", buf);
++}
++
+ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
+     const bool split = ggml_backend_buft_is_cuda_split(src0->buffer->buft);
+ 
+@@ -2591,6 +2617,10 @@ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor
+     if (!split) {
+         const int cc_fp4 = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
+         if (ggml_cuda_fp4_prefill_should_engage(src0, src1, dst, cc_fp4)) {
++            ggml_cuda_mul_mat_route_trace(ggml_cuda_mul_mat_route_shape_make(
++                src0->type, dst->ne[1], src1->ne[1], src1->ne[2], src1->ne[3], split,
++                /*use_vec_f=*/false, /*use_mat_f=*/false, /*use_vec_q=*/false, /*use_mmq=*/false,
++                /*use_batched_cublas=*/false, /*use_fp4_prefill=*/true, /*use_fwht=*/false));
+             ggml_cuda_mul_mat_fp4_large_m(ctx, src0, src1, dst);
+             return;
+         }
+@@ -2654,12 +2684,23 @@ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor
+     bool use_batched_cublas_f16  = src0->type == GGML_TYPE_F16 && (src1->type == GGML_TYPE_F16 || !any_gpus_with_slow_fp16);
+     bool use_batched_cublas_bf16 = src0->type == GGML_TYPE_BF16 && bf16_mma_hardware_available(cc);
+     bool use_batched_cublas_f32  = src0->type == GGML_TYPE_F32;
++    bool use_batched_cublas      = !split && (use_batched_cublas_f16 || use_batched_cublas_bf16 || use_batched_cublas_f32)
++        && !ggml_is_transposed(src0) && !ggml_is_transposed(src1) && src1->ne[2]*src1->ne[3] > 1;
+ 
+     const int32_t hint = ggml_get_op_params_i32(dst, 1);
+     if (hint == GGML_HINT_SRC0_IS_HADAMARD && !split && ggml_cuda_op_fwht(ctx, src1, dst)) {
++        ggml_cuda_mul_mat_route_trace(ggml_cuda_mul_mat_route_shape_make(
++            src0->type, dst->ne[1], src1->ne[1], src1->ne[2], src1->ne[3], split,
++            use_mul_mat_vec_f, use_mul_mat_f, use_mul_mat_vec_q, use_mul_mat_q,
++            use_batched_cublas, /*use_fp4_prefill=*/false, /*use_fwht=*/true));
+         return;
+     }
+ 
++    ggml_cuda_mul_mat_route_trace(ggml_cuda_mul_mat_route_shape_make(
++        src0->type, dst->ne[1], src1->ne[1], src1->ne[2], src1->ne[3], split,
++        use_mul_mat_vec_f, use_mul_mat_f, use_mul_mat_vec_q, use_mul_mat_q,
++        use_batched_cublas, /*use_fp4_prefill=*/false, /*use_fwht=*/false));
++
+     if (!split && use_mul_mat_vec_f) {
+         // the custom F16 vector kernel can be used over batched cuBLAS GEMM
+         // but this is only faster for GPUs without tensor cores or with a thin src0 matrix (particularly KQV in attention)
+@@ -2670,8 +2711,7 @@ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor
+         ggml_cuda_mul_mat_vec_q(ctx, src0, src1, nullptr, dst);
+     } else if (!split && use_mul_mat_q) {
+         ggml_cuda_mul_mat_q(ctx, src0, src1, nullptr, dst);
+-    } else if (!split && (use_batched_cublas_f16 || use_batched_cublas_bf16 || use_batched_cublas_f32)
+-        && !ggml_is_transposed(src0) && !ggml_is_transposed(src1) && src1->ne[2]*src1->ne[3] > 1) {
++    } else if (use_batched_cublas) {
+         // general KQ + KQV multi-batch without FlashAttention
+         ggml_cuda_mul_mat_batched_cublas(ctx, src0, src1, dst);
+     } else if (use_mul_mat_vec_f) {
+diff --git a/ggml/src/ggml-cuda/mmq-shape-trace.h b/ggml/src/ggml-cuda/mmq-shape-trace.h
+index da234a302..8ac373fd9 100644
+--- a/ggml/src/ggml-cuda/mmq-shape-trace.h
++++ b/ggml/src/ggml-cuda/mmq-shape-trace.h
+@@ -71,6 +71,37 @@ struct ggml_cuda_mmid_route_shape {
+     bool host_sync;
+ };
+ 
++enum ggml_cuda_mul_mat_route {
++    GGML_CUDA_MUL_MAT_ROUTE_FP4_PREFILL,
++    GGML_CUDA_MUL_MAT_ROUTE_FWHT,
++    GGML_CUDA_MUL_MAT_ROUTE_VEC_F,
++    GGML_CUDA_MUL_MAT_ROUTE_MAT_F,
++    GGML_CUDA_MUL_MAT_ROUTE_VEC_Q,
++    GGML_CUDA_MUL_MAT_ROUTE_MMQ,
++    GGML_CUDA_MUL_MAT_ROUTE_BATCHED_CUBLAS,
++    GGML_CUDA_MUL_MAT_ROUTE_OP_VEC_F,
++    GGML_CUDA_MUL_MAT_ROUTE_OP_VEC_Q,
++    GGML_CUDA_MUL_MAT_ROUTE_OP_MMQ,
++    GGML_CUDA_MUL_MAT_ROUTE_OP_CUBLAS,
++};
++
++struct ggml_cuda_mul_mat_route_shape {
++    ggml_cuda_mul_mat_route route;
++    int type;
++    int64_t ne1;
++    int64_t ne11;
++    int64_t ne12;
++    int64_t ne13;
++    bool split;
++    bool use_vec_f;
++    bool use_mat_f;
++    bool use_vec_q;
++    bool use_mmq;
++    bool use_batched_cublas;
++    bool use_fp4_prefill;
++    bool use_fwht;
++};
++
+ static inline const char * ggml_cuda_mmid_route_name(const ggml_cuda_mmid_route route) {
+     switch (route) {
+         case GGML_CUDA_MMID_ROUTE_MMVQ:     return "mmvq";
+@@ -83,6 +114,24 @@ static inline const char * ggml_cuda_mmid_route_name(const ggml_cuda_mmid_route
+     return "unknown";
+ }
+ 
++static inline const char * ggml_cuda_mul_mat_route_name(const ggml_cuda_mul_mat_route route) {
++    switch (route) {
++        case GGML_CUDA_MUL_MAT_ROUTE_FP4_PREFILL:    return "fp4_prefill";
++        case GGML_CUDA_MUL_MAT_ROUTE_FWHT:           return "fwht";
++        case GGML_CUDA_MUL_MAT_ROUTE_VEC_F:          return "vec_f";
++        case GGML_CUDA_MUL_MAT_ROUTE_MAT_F:          return "mat_f";
++        case GGML_CUDA_MUL_MAT_ROUTE_VEC_Q:          return "vec_q";
++        case GGML_CUDA_MUL_MAT_ROUTE_MMQ:            return "mmq";
++        case GGML_CUDA_MUL_MAT_ROUTE_BATCHED_CUBLAS: return "batched_cublas";
++        case GGML_CUDA_MUL_MAT_ROUTE_OP_VEC_F:       return "op_vec_f";
++        case GGML_CUDA_MUL_MAT_ROUTE_OP_VEC_Q:       return "op_vec_q";
++        case GGML_CUDA_MUL_MAT_ROUTE_OP_MMQ:         return "op_mmq";
++        case GGML_CUDA_MUL_MAT_ROUTE_OP_CUBLAS:      return "op_cublas";
++    }
++
++    return "unknown";
++}
++
+ static inline ggml_cuda_mmq_shape ggml_cuda_mmq_shape_make(
+         const int type, const bool is_moe, const int64_t ncols_dst, const int64_t nchannels_x,
+         const int64_t ncols_max, const int mmq_x_max, const int mmq_x_lim, const int mmq_x_best,
+@@ -110,6 +159,52 @@ static inline ggml_cuda_mmq_shape ggml_cuda_mmq_shape_make(
+     };
+ }
+ 
++static inline ggml_cuda_mul_mat_route_shape ggml_cuda_mul_mat_route_shape_make(
++        const int type, const int64_t ne1, const int64_t ne11, const int64_t ne12, const int64_t ne13,
++        const bool split, const bool use_vec_f, const bool use_mat_f, const bool use_vec_q,
++        const bool use_mmq, const bool use_batched_cublas, const bool use_fp4_prefill,
++        const bool use_fwht) {
++    ggml_cuda_mul_mat_route route = GGML_CUDA_MUL_MAT_ROUTE_OP_CUBLAS;
++    if (use_fp4_prefill) {
++        route = GGML_CUDA_MUL_MAT_ROUTE_FP4_PREFILL;
++    } else if (use_fwht) {
++        route = GGML_CUDA_MUL_MAT_ROUTE_FWHT;
++    } else if (!split && use_vec_f) {
++        route = GGML_CUDA_MUL_MAT_ROUTE_VEC_F;
++    } else if (!split && use_mat_f) {
++        route = GGML_CUDA_MUL_MAT_ROUTE_MAT_F;
++    } else if (!split && use_vec_q) {
++        route = GGML_CUDA_MUL_MAT_ROUTE_VEC_Q;
++    } else if (!split && use_mmq) {
++        route = GGML_CUDA_MUL_MAT_ROUTE_MMQ;
++    } else if (!split && use_batched_cublas) {
++        route = GGML_CUDA_MUL_MAT_ROUTE_BATCHED_CUBLAS;
++    } else if (use_vec_f) {
++        route = GGML_CUDA_MUL_MAT_ROUTE_OP_VEC_F;
++    } else if (use_vec_q) {
++        route = GGML_CUDA_MUL_MAT_ROUTE_OP_VEC_Q;
++    } else if (use_mmq) {
++        route = GGML_CUDA_MUL_MAT_ROUTE_OP_MMQ;
++    }
++
++    return {
++        route,
++        type,
++        ne1,
++        ne11,
++        ne12,
++        ne13,
++        split,
++        use_vec_f,
++        use_mat_f,
++        use_vec_q,
++        use_mmq,
++        use_batched_cublas,
++        use_fp4_prefill,
++        use_fwht,
++    };
++}
++
+ static inline ggml_cuda_mmid_route_shape ggml_cuda_mmid_route_shape_make(
+         const int type, const int64_t ne2, const int64_t ne12, const int64_t n_experts,
+         const int mmvq_max, const bool use_mmq, const bool use_mmf, const bool is_amd,
+@@ -260,6 +355,28 @@ static inline int ggml_cuda_mmid_route_shape_format(
+         shape.is_quantized ? 1 : 0);
+ }
+ 
++static inline int ggml_cuda_mul_mat_route_shape_format(
++        char * buf, const size_t size, const ggml_cuda_mul_mat_route_shape & shape) {
++    return std::snprintf(buf, size,
++        "route=%s type=%d ne1=%lld ne11=%lld ne12=%lld ne13=%lld split=%d "
++        "use_vec_f=%d use_mat_f=%d use_vec_q=%d use_mmq=%d use_batched_cublas=%d "
++        "use_fp4_prefill=%d use_fwht=%d",
++        ggml_cuda_mul_mat_route_name(shape.route),
++        shape.type,
++        (long long) shape.ne1,
++        (long long) shape.ne11,
++        (long long) shape.ne12,
++        (long long) shape.ne13,
++        shape.split ? 1 : 0,
++        shape.use_vec_f ? 1 : 0,
++        shape.use_mat_f ? 1 : 0,
++        shape.use_vec_q ? 1 : 0,
++        shape.use_mmq ? 1 : 0,
++        shape.use_batched_cublas ? 1 : 0,
++        shape.use_fp4_prefill ? 1 : 0,
++        shape.use_fwht ? 1 : 0);
++}
++
+ static inline int ggml_cuda_mmq_small_m_shape_format(
+         char * buf, const size_t size, const ggml_cuda_mmq_small_m_shape & shape) {
+     return std::snprintf(buf, size,
+diff --git a/tests/test-cuda-mmq-shape-trace.cpp b/tests/test-cuda-mmq-shape-trace.cpp
+index e190cf1ac..2bd41d1d8 100644
+--- a/tests/test-cuda-mmq-shape-trace.cpp
++++ b/tests/test-cuda-mmq-shape-trace.cpp
+@@ -200,5 +200,90 @@ int main() {
+     require(std::strstr(buf, "use_mmq=1") != nullptr, "MMID trace includes MMQ predicate");
+     require(std::strstr(buf, "use_mmf=1") != nullptr, "MMID trace includes MMF predicate");
+ 
++    const ggml_cuda_mul_mat_route_shape mat_f = ggml_cuda_mul_mat_route_shape_make(
++        /* type */ 30,
++        /* ne1 */ 128,
++        /* ne11 */ 128,
++        /* ne12 */ 1,
++        /* ne13 */ 1,
++        /* split */ false,
++        /* use_vec_f */ false,
++        /* use_mat_f */ true,
++        /* use_vec_q */ false,
++        /* use_mmq */ false,
++        /* use_batched_cublas */ false,
++        /* use_fp4_prefill */ false,
++        /* use_fwht */ false);
++
++    require(mat_f.route == GGML_CUDA_MUL_MAT_ROUTE_MAT_F, "regular MUL_MAT prefers direct mat_f when available");
++    require(!mat_f.split, "regular MUL_MAT trace records split flag");
++
++    const ggml_cuda_mul_mat_route_shape batched = ggml_cuda_mul_mat_route_shape_make(
++        /* type */ 31,
++        /* ne1 */ 128,
++        /* ne11 */ 128,
++        /* ne12 */ 4,
++        /* ne13 */ 1,
++        /* split */ false,
++        /* use_vec_f */ false,
++        /* use_mat_f */ false,
++        /* use_vec_q */ false,
++        /* use_mmq */ false,
++        /* use_batched_cublas */ true,
++        /* use_fp4_prefill */ false,
++        /* use_fwht */ false);
++
++    require(batched.route == GGML_CUDA_MUL_MAT_ROUTE_BATCHED_CUBLAS,
++        "regular MUL_MAT records batched cuBLAS route");
++
++    const ggml_cuda_mul_mat_route_shape op_cublas = ggml_cuda_mul_mat_route_shape_make(
++        /* type */ 0,
++        /* ne1 */ 16,
++        /* ne11 */ 16,
++        /* ne12 */ 1,
++        /* ne13 */ 1,
++        /* split */ true,
++        /* use_vec_f */ false,
++        /* use_mat_f */ false,
++        /* use_vec_q */ false,
++        /* use_mmq */ false,
++        /* use_batched_cublas */ false,
++        /* use_fp4_prefill */ false,
++        /* use_fwht */ false);
++
++    require(op_cublas.route == GGML_CUDA_MUL_MAT_ROUTE_OP_CUBLAS,
++        "regular MUL_MAT fallback records op cublas route");
++
++    const ggml_cuda_mul_mat_route_shape fp4_prefill = ggml_cuda_mul_mat_route_shape_make(
++        /* type */ 39,
++        /* ne1 */ 512,
++        /* ne11 */ 512,
++        /* ne12 */ 1,
++        /* ne13 */ 1,
++        /* split */ false,
++        /* use_vec_f */ false,
++        /* use_mat_f */ false,
++        /* use_vec_q */ false,
++        /* use_mmq */ true,
++        /* use_batched_cublas */ false,
++        /* use_fp4_prefill */ true,
++        /* use_fwht */ false);
++
++    require(fp4_prefill.route == GGML_CUDA_MUL_MAT_ROUTE_FP4_PREFILL,
++        "regular MUL_MAT records native FP4 prefill route before MMQ");
++
++    const int mul_mat_route_n = ggml_cuda_mul_mat_route_shape_format(buf, sizeof(buf), mat_f);
++
++    require(mul_mat_route_n > 0, "regular MUL_MAT route format returns byte count");
++    require(std::strstr(buf, "route=mat_f") != nullptr, "regular MUL_MAT trace includes route name");
++    require(std::strstr(buf, "type=30") != nullptr, "regular MUL_MAT trace includes type");
++    require(std::strstr(buf, "ne1=128") != nullptr, "regular MUL_MAT trace includes output columns");
++    require(std::strstr(buf, "ne11=128") != nullptr, "regular MUL_MAT trace includes src1 columns");
++    require(std::strstr(buf, "ne12=1") != nullptr, "regular MUL_MAT trace includes src1 batch dim");
++    require(std::strstr(buf, "split=0") != nullptr, "regular MUL_MAT trace includes split flag");
++    require(std::strstr(buf, "use_mat_f=1") != nullptr, "regular MUL_MAT trace includes mat_f predicate");
++    require(std::strstr(buf, "use_batched_cublas=0") != nullptr,
++        "regular MUL_MAT trace includes batched cuBLAS predicate");
++
+     return 0;
+ }
+-- 
+2.43.0
+
diff --git a/docs/superpowers/plans/2026-07-01-mul-mat-route-trace-phase35.md b/docs/superpowers/plans/2026-07-01-mul-mat-route-trace-phase35.md
new file mode 100644
index 000000000..6f0adeb4a
--- /dev/null
+++ b/docs/superpowers/plans/2026-07-01-mul-mat-route-trace-phase35.md
@@ -0,0 +1,54 @@
+# Phase 35: Regular MUL_MAT Route Trace
+
+**Goal:** Split the projection-heavy regular `MUL_MAT` serving bucket into concrete dispatch routes before attempting a projection optimization.
+
+**Scope:** llama.cpp fork first, then LocalAI patch `0061`. Instrumentation only; no route or numeric behavior change.
+
+## Plan
+
+- [x] Inspect regular `ggml_cuda_mul_mat` dispatch order and projection bucket docs.
+- [x] Dispatch sidecar explorers for llama.cpp route context and vLLM projection context.
+- [x] Add failing host-only tests for regular `MUL_MAT` route classification.
+- [x] Implement `ggml_cuda_mul_mat_route_shape_make()` and formatter.
+- [x] Wire default-off `LLAMA_MUL_MAT_ROUTE_TRACE=<n>` around regular `MUL_MAT`.
+- [x] Build and run `test-cuda-mmq-shape-trace` locally.
+- [x] Build `llama-server`, `llama-completion`, `test-backend-ops`, and `test-cuda-mmq-shape-trace` on DGX.
+- [x] Run default-off and trace-enabled md5/op gates.
+- [x] Run n128 serving route trace and parse route/type/shape counts.
+- [x] Run post-serving md5/op gates.
+- [x] Commit fork and DGX mirror, export LocalAI patch `0061`.
+- [x] Update README, parity docs, handoff, and patch maintenance.
+- [x] Re-run strict patch-series mirror invariant.
+
+## Results
+
+Artifact: `/home/mudler/bench/phase35_mul_mat_route_trace/20260701_074359`.
+
+Commits:
+
+- Fork: `486c28c63 feat(cuda): trace mul mat routes`
+- DGX mirror: `18f7ad005 feat(cuda): trace mul mat routes`
+- LocalAI patch: `backend/cpp/llama-cpp-localai-paged/patches/paged/0061-feat-cuda-trace-mul-mat-routes.patch`
+
+Gates:
+
+- Default-off MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
+- Default-off dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
+- Trace-enabled MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
+- Trace-enabled dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
+- Post-serving MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
+- Post-serving dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
+- `MUL_MAT`: `1146/1146` in default, trace-enabled, and post-serving gates
+- `MUL_MAT_ID`: `806/806` in default, trace-enabled, and post-serving gates
+
+n128 route trace:
+
+- `mat_f`: 2888
+- `op_cublas`: 2292
+- `mmq`: 1328
+- `vec_q`: 1214
+- `vec_f`: 470
+
+BF16 (`type=30`) was the largest traced type: 3965 calls, split into `mat_f=2485` and `op_cublas=1330`. Top BF16 shapes were `mat_f ne1=12` 775, `op_cublas ne1=18` 760, and `mat_f ne1=8` 570.
+
+Decision: next projection work should add cuBLAS/MMF subroute detail or test a narrow BF16 route policy for generic `op_cublas` shapes. Do not spend effort on batched cuBLAS for this measured n128 serving slice.