feat(paged): add cublas tensor-name trace patch

Add patch 0063 extending LLAMA_CUBLAS_ROUTE_TRACE with src0/src1/dst tensor names. Record Phase 37 gates and the conclusion that SGEMM traces to MoE gate tensors. Assisted-by: Codex:gpt-5
2026-07-02 20:37:03 -04:00 · 2026-07-01 06:41:00 +00:00
parent fbdc200886
commit 9f75da01f9
7 changed files with 330 additions and 10 deletions
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -87,7 +87,7 @@ orthogonal to the paged allocator.

 ---

-## 3. Patch series (0001-0062)
+## 3. Patch series (0001-0063)

 Source-only patches, with intentional numbering gaps (e.g. 0005, 0027). The
 decode-serving graph-reuse levers are 0040-0041. "Bit-exact" = greedy md5 /
@@ -220,6 +220,7 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact
 | 0060 | **Trace MoE MMID dispatch routes** - adds default-off `LLAMA_MOE_MMID_ROUTE_TRACE=<n>` around `MUL_MAT_ID` dispatch, classifying each call as `mmvq`, `mmvf`, grouped `mmq`, `mmf`, or host-sync `fallback`. This is evidence-only instrumentation to resolve whether serving hits the per-expert host-sync fallback. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 34 n128 trace found `mmq=2776`, `mmvq=1320`, `host_sync=0/4096`) |
 | 0061 | **Trace regular MUL_MAT dispatch routes** - adds default-off `LLAMA_MUL_MAT_ROUTE_TRACE=<n>` around regular `MUL_MAT`, classifying projection-heavy calls as `vec_f`, `mat_f`, `vec_q`, `mmq`, `batched_cublas`, `op_*`, `fp4_prefill`, or `fwht`. This is evidence-only instrumentation for the `bf16-proj` serving bucket. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`; Phase 35 n128 trace found BF16 routes `mat_f=2485`, `op_cublas=1330`) |
 | 0062 | **Trace cuBLAS subroutes** - adds default-off `LLAMA_CUBLAS_ROUTE_TRACE=<n>` around the generic cuBLAS `MUL_MAT` path, classifying calls as `nvfp4_bf16_tc`, `bf16_tc`, `f16_tc_32f`, `f16_tc_16f`, or `sgemm`. This is evidence-only instrumentation for the Phase 35 `op_cublas` bucket. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`; Phase 36 n128 trace found `bf16_tc=5681`, `sgemm=2511`) |
+| 0063 | **Trace cuBLAS tensor names** - extends `LLAMA_CUBLAS_ROUTE_TRACE=<n>` with `src0`, `src1`, and `dst` names so the `sgemm` bucket can be tied back to graph nodes. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`; Phase 37 n128 trace identified `sgemm` as `ffn_gate_inp* -> ffn_moe_logits/shared_expert_gate`) |

 > **Dropped: patch 0026 (hybrid per-head bf16 SSM state, `ssm_bf16_tau`).** Once
 > the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache)
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -2160,3 +2160,57 @@ Decision:
 - The next projection phase should identify whether the `type=0` SGEMM shapes
  are expected glue tensors or a missed BF16 route. Do not change routing until
  a separately gated policy proves md5/op safety.
+
+## Phase 37 cuBLAS Tensor-Name Trace
+
+Phase 37 added patch `0063`, extending the default-off
+`LLAMA_CUBLAS_ROUTE_TRACE=<n>` diagnostic with `src0`, `src1`, and `dst` tensor
+names. It is instrumentation only.
+
+Artifact:
+
+- `/home/mudler/bench/phase37_cublas_name_trace/20260701_083227`
+
+Run:
+
+- Fork commit: `/home/mudler/_git/llama.cpp` `2d590d770`
+- DGX mirror commit: `dgx:~/llama-phase6-source` `2cbb61969`
+- Env: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 LLAMA_CUBLAS_ROUTE_TRACE=4096`
+- Workload: staggered n128 `llama-server` diagnostic trace
+
+Route summary:
+
+| route | count |
+|-------|------:|
+| `bf16_tc` | 2884 |
+| `sgemm` | 1212 |
+
+Named bucket summary:
+
+| route | tensor pattern |
+|-------|----------------|
+| `bf16_tc` | `blk.N.attn_gate.weight -> z-N` |
+| `bf16_tc` | `blk.N.ssm_out.weight -> linear_attn_out-N` |
+| `sgemm` | `blk.N.ffn_gate_inp.weight -> ffn_moe_logits-N` |
+| `sgemm` | `blk.N.ffn_gate_inp_shexp.weight -> shared_expert_gate-N` |
+
+Gates:
+
+| check | status | actual |
+|-------|--------|--------|
+| default-off MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| default-off dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| trace-enabled MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| trace-enabled dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| post-serving MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| post-serving dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| `MUL_MAT` | ok | `1146/1146` default, trace, post-serving |
+| `MUL_MAT_ID` | ok | `806/806` default, trace, post-serving |
+
+Decision:
+
+- The Phase 36 F32 SGEMM bucket is mainly MoE gate logits and shared-expert gate
+  projections, not an anonymous missed dense projection route.
+- Do not blindly force these calls to BF16. First inspect the model-load tensor
+  types for `ffn_gate_inp*`; if changing weight dtype or graph routing is
+  considered, require md5/op gates and KL validation.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -478,6 +478,18 @@ MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
 shapes are expected glue tensors or a missed BF16 route; do not chase NVFP4
 cuBLAS or batched cuBLAS for this measured bucket.

+Phase 37 added patch `0063`, extending `LLAMA_CUBLAS_ROUTE_TRACE=<n>` with
+`src0`, `src1`, and `dst` tensor names. Artifact:
+`/home/mudler/bench/phase37_cublas_name_trace/20260701_083227`. Fork commit:
+`2d590d770 feat(cuda): trace cublas tensor names`; DGX mirror commit:
+`2cbb61969`. Default-off, trace-enabled, and post-serving gates stayed green:
+MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
+`806/806`. Live n128 trace found `bf16_tc=2884`, `sgemm=1212`. The `sgemm`
+bucket is `blk.N.ffn_gate_inp.weight -> ffn_moe_logits-N` and
+`blk.N.ffn_gate_inp_shexp.weight -> shared_expert_gate-N`; do not force BF16
+without first inspecting model-load tensor types and running KL validation.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -527,15 +539,15 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 ## 7. KEY FILE / ARTIFACT INDEX

 ### Fork (canonical source of truth)
- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `38c4ef2e4` ("trace cublas routes", patch `0062`).
- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `e0224393a` with the Phase 36 cuBLAS route-trace patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes.
+- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `2d590d770` ("trace cublas tensor names", patch `0063`).
+- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `2cbb61969` with the Phase 37 cuBLAS tensor-name trace patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes.
 - Historical DGX dev tree: `dgx:~/llama-paged-dev`, branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). It is an old experimental tree and must not be treated as canonical.

 ### LocalAI worktree
 - Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (currently 246 ahead, 31 behind `origin/master`; recompute before reporting).
 - Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`).
 - `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`.
- `patches/paged/`: **53** `.patch` files spanning 0001-0062 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055; MoE MMQ selector/launch/candidate/tile-policy/route instrumentation is 0056-0060; regular MUL_MAT route instrumentation is 0061; cuBLAS route instrumentation is 0062.
+- `patches/paged/`: **54** `.patch` files spanning 0001-0063 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055; MoE MMQ selector/launch/candidate/tile-policy/route instrumentation is 0056-0060; regular MUL_MAT route instrumentation is 0061; cuBLAS route instrumentation is 0062-0063.

 ### Bench artifacts (DGX)
 - `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - historical same-session both-engine run.
@@ -554,6 +566,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 - `~/bench/phase34_mmid_route_trace/20260701_072737` - default-off MoE MMID route trace patch `0060`; default/trace/post-serving md5 gates green; n128 route trace found `mmq=2776`, `mmvq=1320`, `host_sync=0/4096`.
 - `~/bench/phase35_mul_mat_route_trace/20260701_074359` - default-off regular MUL_MAT route trace patch `0061`; default/trace/post-serving md5 gates green; n128 route trace found BF16 `mat_f=2485`, `op_cublas=1330`.
 - `~/bench/phase36_cublas_route_trace/20260701_081228` - default-off cuBLAS subroute trace patch `0062`; default/trace/post-serving md5 and op gates green; n128 route trace found `bf16_tc=5681`, `sgemm=2511`.
+- `~/bench/phase37_cublas_name_trace/20260701_083227` - cuBLAS tensor-name trace patch `0063`; default/trace/post-serving md5 and op gates green; n128 trace identified `sgemm` as MoE gate logits and shared-expert gate projections.
 - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
 - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
 - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
@@ -566,8 +579,8 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual

 ### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels)
 1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building.
-2. **Current fork/mirror are clean and verified.** Local fork HEAD is `38c4ef2e4`, DGX clean mirror HEAD is `e0224393a`, and Phase 36 should be treated as the current patch-series tip. The old `llama-paged-dev` tree is historical only.
-3. **Worktree patch series is tracked through 0062.** The only expected unrelated untracked path in this worktree is `.claude/`.
+2. **Current fork/mirror are clean and verified.** Local fork HEAD is `2d590d770`, DGX clean mirror HEAD is `2cbb61969`, and Phase 37 should be treated as the current patch-series tip. The old `llama-paged-dev` tree is historical only.
+3. **Worktree patch series is tracked through 0063.** The only expected unrelated untracked path in this worktree is `.claude/`.
 4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch).
 5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign.

--- a/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md
@@ -57,18 +57,18 @@ everywhere without ever touching the stock `llama-cpp` source tree.

 ## Latest mirror check

-Phase 36 re-verified the mirror invariant after adding patch `0062`:
+Phase 37 re-verified the mirror invariant after adding patch `0063`:

 ```text
 base=0ed235ea2c17a19fc8238668653946721ed136fd
-applied_tree=208189d119efe27477f1900cc6f7428bd1720449
-fork_tree=208189d119efe27477f1900cc6f7428bd1720449
+applied_tree=dedb1182910eafe9f6875588dc8285bfb544cce5
+fork_tree=dedb1182910eafe9f6875588dc8285bfb544cce5
 ```

 The check used a fresh worktree at `LLAMA_VERSION`, applied every
 `patches/paged/0*.patch` with strict `git apply`, staged the result, and compared
 `git write-tree` to canonical fork branch `localai-paged` at
-`38c4ef2e4 feat(cuda): trace cublas routes`.
+`2d590d770 feat(cuda): trace cublas tensor names`.

 ## Status

--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -955,6 +955,27 @@ tensor-core plus F32 SGEMM, not NVFP4 cuBLAS and not batched cuBLAS. The next
 projection phase should explain whether the F32 SGEMM shapes are expected glue
 tensors or a missed BF16 route, with md5/op gates before any route policy A/B.

+### Phase 37 cuBLAS tensor-name trace
+
+Phase 37 added patch `0063`, extending `LLAMA_CUBLAS_ROUTE_TRACE=<n>` with
+`src0`, `src1`, and `dst` names. Artifact:
+`/home/mudler/bench/phase37_cublas_name_trace/20260701_083227`.
+
+Default-off, trace-enabled, and post-serving gates stayed bit-exact: MoE
+`8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
+`806/806`.
+
+Live n128 serving with trace cap 4096 found `bf16_tc=2884`, `sgemm=1212`.
+The `sgemm type=0` entries are the MoE gate logits and shared-expert gate
+projections: `blk.N.ffn_gate_inp.weight -> ffn_moe_logits-N` and
+`blk.N.ffn_gate_inp_shexp.weight -> shared_expert_gate-N`. Attention and SSM
+projections in the sample are already `bf16_tc`.
+
+Lever implication: do not blindly force the `sgemm` bucket to BF16. First inspect
+why `ffn_gate_inp*` loads as F32 and whether a dtype or graph-route change is
+precision-safe. If attempted, use md5/op gates plus KL validation.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0063-feat-cuda-trace-cublas-tensor-names.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0063-feat-cuda-trace-cublas-tensor-names.patch
@@ -0,0 +1,162 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Wed, 1 Jul 2026 06:38:11 +0000
+Subject: [PATCH] feat(cuda): trace cublas tensor names
+
+Extend LLAMA_CUBLAS_ROUTE_TRACE with src0/src1/dst tensor names so SGEMM and BF16 cuBLAS buckets can be tied back to graph nodes.
+
+Assisted-by: Codex:gpt-5
+---
+ ggml/src/ggml-cuda/ggml-cuda.cu      |  5 +++--
+ ggml/src/ggml-cuda/mmq-shape-trace.h | 16 +++++++++++++---
+ tests/test-cuda-mmq-shape-trace.cpp  | 18 ++++++++++++------
+ 3 files changed, 28 insertions(+), 11 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
+index eff197818..1da67d2af 100644
+--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
+@@ -1648,7 +1648,7 @@ static inline void ggml_cuda_cublas_route_trace(const ggml_cuda_cublas_route_sha
+         return;
+     }
+ 
+-    char buf[320];
+    char buf[512];
+     ggml_cuda_cublas_route_shape_format(buf, sizeof(buf), shape);
+     fprintf(stderr, "[LLAMA_CUBLAS_ROUTE] %s\n", buf);
+ }
+@@ -1701,7 +1701,8 @@ static void ggml_cuda_op_mul_mat_cublas(
+     ggml_cuda_cublas_route_trace(ggml_cuda_cublas_route_shape_make(
+         src0->type, src1->type, row_diff, src1_ncols, ne00, ne10, ldc,
+         supports_bf16, use_fp16, fast_fp16, force_fp32, force_fp16, src0_contig, full_rows,
+-        GGML_CUDA_CC_IS_CDNA(cc), GGML_CUDA_CC_IS_RDNA4(cc), cc == GGML_CUDA_CC_VOLTA));
+        GGML_CUDA_CC_IS_CDNA(cc), GGML_CUDA_CC_IS_RDNA4(cc), cc == GGML_CUDA_CC_VOLTA,
+        src0->name, src1->name, dst->name));
+ 
+     if (supports_bf16 && src0->type == GGML_TYPE_NVFP4 && src0_contig && full_rows) {
+         // Paged prefill lever (patch 0033): NVFP4 only reaches cuBLAS when
+diff --git a/ggml/src/ggml-cuda/mmq-shape-trace.h b/ggml/src/ggml-cuda/mmq-shape-trace.h
+index f5b4ecf2c..b55c7467c 100644
+--- a/ggml/src/ggml-cuda/mmq-shape-trace.h
+++ b/ggml/src/ggml-cuda/mmq-shape-trace.h
+@@ -129,6 +129,9 @@ struct ggml_cuda_cublas_route_shape {
+     bool is_cdna;
+     bool is_rdna4;
+     bool is_volta;
+    const char * src0_name;
+    const char * src1_name;
+    const char * dst_name;
+ };
+ 
+ static inline const char * ggml_cuda_mmid_route_name(const ggml_cuda_mmid_route route) {
+@@ -251,7 +254,7 @@ static inline ggml_cuda_cublas_route_shape ggml_cuda_cublas_route_shape_make(
+         const int64_t ne00, const int64_t ne10, const int64_t ldc, const bool supports_bf16,
+         const bool use_fp16, const bool fast_fp16, const bool force_fp32, const bool force_fp16,
+         const bool src0_contig, const bool full_rows, const bool is_cdna, const bool is_rdna4,
+-        const bool is_volta) {
+        const bool is_volta, const char * src0_name, const char * src1_name, const char * dst_name) {
+     ggml_cuda_cublas_route route = GGML_CUDA_CUBLAS_ROUTE_SGEMM;
+     if (supports_bf16 && type == 40 && src0_contig && full_rows) {
+         route = GGML_CUDA_CUBLAS_ROUTE_NVFP4_BF16_TC;
+@@ -284,6 +287,9 @@ static inline ggml_cuda_cublas_route_shape ggml_cuda_cublas_route_shape_make(
+         is_cdna,
+         is_rdna4,
+         is_volta,
+        src0_name ? src0_name : "",
+        src1_name ? src1_name : "",
+        dst_name ? dst_name : "",
+     };
+ }
+ 
+@@ -464,7 +470,8 @@ static inline int ggml_cuda_cublas_route_shape_format(
+     return std::snprintf(buf, size,
+         "route=%s type=%d src1_type=%d row_diff=%lld src1_ncols=%lld ne00=%lld ne10=%lld ldc=%lld "
+         "supports_bf16=%d use_fp16=%d fast_fp16=%d force_fp32=%d force_fp16=%d "
+-        "src0_contig=%d full_rows=%d is_cdna=%d is_rdna4=%d is_volta=%d",
+        "src0_contig=%d full_rows=%d is_cdna=%d is_rdna4=%d is_volta=%d "
+        "src0=%s src1=%s dst=%s",
+         ggml_cuda_cublas_route_name(shape.route),
+         shape.type,
+         shape.src1_type,
+@@ -482,7 +489,10 @@ static inline int ggml_cuda_cublas_route_shape_format(
+         shape.full_rows ? 1 : 0,
+         shape.is_cdna ? 1 : 0,
+         shape.is_rdna4 ? 1 : 0,
+-        shape.is_volta ? 1 : 0);
+        shape.is_volta ? 1 : 0,
+        shape.src0_name,
+        shape.src1_name,
+        shape.dst_name);
+ }
+ 
+ static inline int ggml_cuda_mmq_small_m_shape_format(
+diff --git a/tests/test-cuda-mmq-shape-trace.cpp b/tests/test-cuda-mmq-shape-trace.cpp
+index 1443749c3..2547193ce 100644
+--- a/tests/test-cuda-mmq-shape-trace.cpp
+++ b/tests/test-cuda-mmq-shape-trace.cpp
+@@ -27,7 +27,7 @@ int main() {
+     require(shape.n_active_est == 256, "active expert estimate is capped by expert count");
+     require(shape.density == 4, "density is ceil(assignments / active experts)");
+ 
+-    char buf[256];
+    char buf[512];
+     const int n = ggml_cuda_mmq_shape_format(buf, sizeof(buf), shape);
+ 
+     require(n > 0, "format returns byte count");
+@@ -302,35 +302,38 @@ int main() {
+         /* full_rows */ true,
+         /* is_cdna */ false,
+         /* is_rdna4 */ false,
+-        /* is_volta */ false);
+        /* is_volta */ false,
+        /* src0_name */ "blk.0.proj.weight",
+        /* src1_name */ "blk.0.proj.inp",
+        /* dst_name */ "blk.0.proj.out");
+ 
+     require(bf16_tc.route == GGML_CUDA_CUBLAS_ROUTE_BF16_TC,
+         "cuBLAS records native BF16 tensor-core route");
+ 
+     const ggml_cuda_cublas_route_shape nvfp4_bf16_tc = ggml_cuda_cublas_route_shape_make(
+         /* type */ 40, 0, 128, 128, 1024, 1024, 128, true, false, true, false, false, true, true,
+-        false, false, false);
+        false, false, false, "nvfp4.weight", "nvfp4.inp", "nvfp4.out");
+ 
+     require(nvfp4_bf16_tc.route == GGML_CUDA_CUBLAS_ROUTE_NVFP4_BF16_TC,
+         "cuBLAS records NVFP4 dequant-to-BF16 tensor-core route");
+ 
+     const ggml_cuda_cublas_route_shape f16_tc_16f = ggml_cuda_cublas_route_shape_make(
+         /* type */ 1, 0, 64, 64, 1024, 1024, 64, false, true, true, false, false, true, true,
+-        false, false, false);
+        false, false, false, "f16.weight", "f16.inp", "f16.out");
+ 
+     require(f16_tc_16f.route == GGML_CUDA_CUBLAS_ROUTE_F16_TC_16F,
+         "cuBLAS records default FP16 tensor-core 16F compute route");
+ 
+     const ggml_cuda_cublas_route_shape f16_tc_32f = ggml_cuda_cublas_route_shape_make(
+         /* type */ 1, 0, 64, 64, 1024, 1024, 64, false, true, true, true, false, true, true,
+-        false, false, false);
+        false, false, false, "f16.weight", "f16.inp", "f16.out");
+ 
+     require(f16_tc_32f.route == GGML_CUDA_CUBLAS_ROUTE_F16_TC_32F,
+         "cuBLAS records forced FP16 tensor-core 32F compute route");
+ 
+     const ggml_cuda_cublas_route_shape sgemm = ggml_cuda_cublas_route_shape_make(
+         /* type */ 0, 0, 12, 12, 1024, 1024, 12, false, false, true, false, false, true, true,
+-        false, false, false);
+        false, false, false, "f32.weight", "f32.inp", "f32.out");
+ 
+     require(sgemm.route == GGML_CUDA_CUBLAS_ROUTE_SGEMM,
+         "cuBLAS records SGEMM fallback route");
+@@ -346,6 +349,9 @@ int main() {
+     require(std::strstr(buf, "supports_bf16=1") != nullptr, "cuBLAS trace includes BF16 predicate");
+     require(std::strstr(buf, "force_fp32=0") != nullptr, "cuBLAS trace includes forced compute predicate");
+     require(std::strstr(buf, "src0_contig=1") != nullptr, "cuBLAS trace includes contiguity predicate");
+    require(std::strstr(buf, "src0=blk.0.proj.weight") != nullptr, "cuBLAS trace includes src0 name");
+    require(std::strstr(buf, "src1=blk.0.proj.inp") != nullptr, "cuBLAS trace includes src1 name");
+    require(std::strstr(buf, "dst=blk.0.proj.out") != nullptr, "cuBLAS trace includes dst name");
+ 
+     return 0;
+ }
+-- 
+2.43.0
+
--- a/docs/superpowers/plans/2026-07-01-cublas-name-trace-phase37.md
+++ b/docs/superpowers/plans/2026-07-01-cublas-name-trace-phase37.md
@@ -0,0 +1,69 @@
+# Phase 37: cuBLAS Tensor-Name Trace
+
+**Status:** DONE.
+
+**Scope:** additive follow-up to patch `0062`. Extend the default-off
+`LLAMA_CUBLAS_ROUTE_TRACE=<n>` diagnostic with `src0`, `src1`, and `dst` tensor
+names. No route or numeric behavior change.
+
+## Checklist
+
+- [x] Add RED/GREEN helper coverage for cuBLAS tensor-name trace fields.
+- [x] Wire tensor names from the generic cuBLAS path.
+- [x] Build CUDA targets on DGX.
+- [x] Run md5 gates with trace off and trace on.
+- [x] Run backend op gates with trace off and trace on.
+- [x] Capture n128 serving name trace.
+- [x] Run post-serving md5/op gates.
+- [x] Commit fork and DGX mirror, export LocalAI patch `0063`.
+
+## Result
+
+Artifact: `/home/mudler/bench/phase37_cublas_name_trace/20260701_083227`.
+
+- Local fork commit: `2d590d770 feat(cuda): trace cublas tensor names`
+- DGX mirror commit: `2cbb61969 feat(cuda): trace cublas tensor names`
+- Local/DGX tree after Phase 37: `dedb1182910eafe9f6875588dc8285bfb544cce5`
+- LocalAI patch: `backend/cpp/llama-cpp-localai-paged/patches/paged/0063-feat-cuda-trace-cublas-tensor-names.patch`
+
+## Gates
+
+| check | status | actual |
+|-------|--------|--------|
+| default-off MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| default-off dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| trace-enabled MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| trace-enabled dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| post-serving MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| post-serving dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| `MUL_MAT` | ok | `1146/1146` default, trace, post-serving |
+| `MUL_MAT_ID` | ok | `806/806` default, trace, post-serving |
+
+## Serving Trace
+
+`LLAMA_CUBLAS_ROUTE_TRACE=4096`, n128 MoE serving:
+
+| cuBLAS route | count |
+|--------------|------:|
+| `bf16_tc` | 2884 |
+| `sgemm` | 1212 |
+
+Top named entries were per-layer projections:
+
+- `bf16_tc type=30 src0=blk.N.attn_gate.weight src1=attn_norm-N dst=z-N`
+- `bf16_tc type=30 src0=blk.N.ssm_out.weight src1=final_output-N dst=linear_attn_out-N`
+- `sgemm type=0 src0=blk.N.ffn_gate_inp.weight src1=attn_post_norm-N dst=ffn_moe_logits-N`
+- `sgemm type=0 src0=blk.N.ffn_gate_inp_shexp.weight src1=attn_post_norm-N dst=shared_expert_gate-N`
+
+The traced serving run is diagnostic only; stderr tracing still depresses
+throughput and can create client-window disconnects. Post-serving md5/op gates
+remained green.
+
+## Decision
+
+- The Phase 36 F32 SGEMM bucket is not an opaque missed projection. It is mostly
+  MoE gating and shared-expert gate projection tensors whose weights are F32.
+- The next route-policy phase should not blindly force these to BF16. First
+  inspect model-load tensor types for `ffn_gate_inp*` and decide whether a
+  weight-conversion or graph-build route change is precision-safe. Any change
+  needs md5/op gates and, if tensor type conversion is involved, KL validation.