feat(paged): add cublas route trace patch

Add patch 0062 with default-off LLAMA_CUBLAS_ROUTE_TRACE instrumentation for generic cuBLAS MUL_MAT subroutes. Record Phase 36 DGX gates, serving trace results, and the next projection follow-up scope. Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 06:24:46 +00:00
parent 49cce0b5a2
commit fbdc200886
7 changed files with 508 additions and 10 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -2104,3 +2104,59 @@ Decision:
 - Next projection work should either add a cuBLAS/MMF subroute trace or test a
  bounded BF16 route policy for the `op_cublas` shapes. Do not chase batched
  cuBLAS for this measured serving slice.
+
+## Phase 36 cuBLAS Subroute Trace
+
+Phase 36 added patch `0062`, a default-off `LLAMA_CUBLAS_ROUTE_TRACE=<n>`
+diagnostic around the generic cuBLAS `MUL_MAT` path. It does not alter branch
+behavior; it classifies existing calls as `nvfp4_bf16_tc`, `bf16_tc`,
+`f16_tc_32f`, `f16_tc_16f`, or `sgemm`.
+
+Artifact:
+
+- `/home/mudler/bench/phase36_cublas_route_trace/20260701_081228`
+
+Run:
+
+- Fork commit: `/home/mudler/_git/llama.cpp` `38c4ef2e4`
+- DGX mirror commit: `dgx:~/llama-phase6-source` `e0224393a`
+- Env: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 LLAMA_CUBLAS_ROUTE_TRACE=8192`
+- Workload: staggered n128 `llama-server` diagnostic trace
+
+Route summary:
+
+| route | count |
+|-------|------:|
+| `bf16_tc` | 5681 |
+| `sgemm` | 2511 |
+
+Top shapes:
+
+| route | shape | count |
+|-------|-------|------:|
+| `bf16_tc` | `type=30 row_diff=32 src1_ncols=510 ne00=2048 ne10=2048` | 360 |
+| `bf16_tc` | `type=30 row_diff=8192 src1_ncols=510 ne00=2048 ne10=2048` | 240 |
+| `bf16_tc` | `type=30 row_diff=2048 src1_ncols=510 ne00=4096 ne10=4096` | 240 |
+| `sgemm` | `type=0 row_diff=256 src1_ncols=510 ne00=2048 ne10=2048` | 240 |
+| `sgemm` | `type=0 row_diff=1 src1_ncols=510 ne00=2048 ne10=2048` | 240 |
+
+Gates:
+
+| check | status | actual |
+|-------|--------|--------|
+| default-off MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| default-off dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| trace-enabled MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| trace-enabled dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| post-serving MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| post-serving dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| `MUL_MAT` | ok | `1146/1146` default, trace, post-serving |
+| `MUL_MAT_ID` | ok | `806/806` default, trace, post-serving |
+
+Decision:
+
+- Phase 35's generic `op_cublas` bucket is BF16 tensor-core plus F32 SGEMM in
+  this serving slice. It is not NVFP4 cuBLAS and not batched cuBLAS.
+- The next projection phase should identify whether the `type=0` SGEMM shapes
+  are expected glue tensors or a missed BF16 route. Do not change routing until
+  a separately gated policy proves md5/op safety.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -466,6 +466,18 @@ MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
 was split `mat_f=2485`, `op_cublas=1330`. Next projection work should target
 BF16 `mat_f`/`op_cublas` subroute evidence or route policy, not batched cuBLAS.

+Phase 36 added patch `0062`, default-off `LLAMA_CUBLAS_ROUTE_TRACE=<n>`, to
+classify the generic cuBLAS `MUL_MAT` subroute without changing branch behavior.
+Artifact: `/home/mudler/bench/phase36_cublas_route_trace/20260701_081228`.
+Fork commit: `38c4ef2e4 feat(cuda): trace cublas routes`; DGX mirror commit:
+`e0224393a`. Default-off, trace-enabled, and post-serving gates stayed green:
+MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
+`806/806`. Live n128 serving with trace cap 8192 found `bf16_tc=5681` and
+`sgemm=2511`. The next projection phase should explain whether the F32 SGEMM
+shapes are expected glue tensors or a missed BF16 route; do not chase NVFP4
+cuBLAS or batched cuBLAS for this measured bucket.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -515,15 +527,15 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 ## 7. KEY FILE / ARTIFACT INDEX

 ### Fork (canonical source of truth)
- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `486c28c63d5297afd06e5a2bdbd4fb89cad749cd` ("trace mul mat routes", patch `0061`).
- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `18f7ad005` with the Phase 35 regular MUL_MAT route-trace patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes.
+- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `38c4ef2e4` ("trace cublas routes", patch `0062`).
+- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `e0224393a` with the Phase 36 cuBLAS route-trace patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes.
 - Historical DGX dev tree: `dgx:~/llama-paged-dev`, branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). It is an old experimental tree and must not be treated as canonical.

 ### LocalAI worktree
 - Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (currently 246 ahead, 31 behind `origin/master`; recompute before reporting).
 - Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`).
 - `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`.
- `patches/paged/`: **52** `.patch` files spanning 0001-0061 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055; MoE MMQ selector/launch/candidate/tile-policy/route instrumentation is 0056-0060; regular MUL_MAT route instrumentation is 0061.
+- `patches/paged/`: **53** `.patch` files spanning 0001-0062 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055; MoE MMQ selector/launch/candidate/tile-policy/route instrumentation is 0056-0060; regular MUL_MAT route instrumentation is 0061; cuBLAS route instrumentation is 0062.

 ### Bench artifacts (DGX)
 - `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - historical same-session both-engine run.
@@ -541,6 +553,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 - `~/bench/phase33_small_m_tile_policy/20260701_071136` - default-off MoE MMQ small-M tile policy patch `0059`; tile16/tile8 md5/op safe but both slower in n128 serving.
 - `~/bench/phase34_mmid_route_trace/20260701_072737` - default-off MoE MMID route trace patch `0060`; default/trace/post-serving md5 gates green; n128 route trace found `mmq=2776`, `mmvq=1320`, `host_sync=0/4096`.
 - `~/bench/phase35_mul_mat_route_trace/20260701_074359` - default-off regular MUL_MAT route trace patch `0061`; default/trace/post-serving md5 gates green; n128 route trace found BF16 `mat_f=2485`, `op_cublas=1330`.
+- `~/bench/phase36_cublas_route_trace/20260701_081228` - default-off cuBLAS subroute trace patch `0062`; default/trace/post-serving md5 and op gates green; n128 route trace found `bf16_tc=5681`, `sgemm=2511`.
 - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
 - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
 - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
@@ -553,8 +566,8 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual

 ### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels)
 1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building.
-2. **Current fork/mirror are clean and verified.** Local fork HEAD is `486c28c63`, DGX clean mirror HEAD is `18f7ad005`, and Phase 35 should be treated as the current patch-series tip. The old `llama-paged-dev` tree is historical only.
-3. **Worktree patch series is tracked through 0061.** The only expected unrelated untracked path in this worktree is `.claude/`.
+2. **Current fork/mirror are clean and verified.** Local fork HEAD is `38c4ef2e4`, DGX clean mirror HEAD is `e0224393a`, and Phase 36 should be treated as the current patch-series tip. The old `llama-paged-dev` tree is historical only.
+3. **Worktree patch series is tracked through 0062.** The only expected unrelated untracked path in this worktree is `.claude/`.
 4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch).
 5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign.

--- a/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md
@@ -57,18 +57,18 @@ everywhere without ever touching the stock `llama-cpp` source tree.

 ## Latest mirror check

-Phase 35 re-verified the mirror invariant after adding patch `0061`:
+Phase 36 re-verified the mirror invariant after adding patch `0062`:

 ```text
 base=0ed235ea2c17a19fc8238668653946721ed136fd
-applied_tree=305ebb96801822f2132ed9e9c868308b0759c7b9
-fork_tree=305ebb96801822f2132ed9e9c868308b0759c7b9
+applied_tree=208189d119efe27477f1900cc6f7428bd1720449
+fork_tree=208189d119efe27477f1900cc6f7428bd1720449
 ```

 The check used a fresh worktree at `LLAMA_VERSION`, applied every
 `patches/paged/0*.patch` with strict `git apply`, staged the result, and compared
 `git write-tree` to canonical fork branch `localai-paged` at
-`486c28c63 feat(cuda): trace mul mat routes`.
+`38c4ef2e4 feat(cuda): trace cublas routes`.

 ## Status

--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -931,6 +931,30 @@ in a future serving configuration, first isolate `mtp_eh_proj` / shared-head
 projection with `llama-debug --tensor-filter 'mtp_|h_nextn|nextn|ffn_|attn_'`
 before optimizing ordinary decoder projections.

+### Phase 36 cuBLAS subroute trace
+
+Phase 36 added patch `0062`, default-off `LLAMA_CUBLAS_ROUTE_TRACE=<n>`, to
+classify the generic cuBLAS `MUL_MAT` subroute without changing branch behavior.
+Artifact: `/home/mudler/bench/phase36_cublas_route_trace/20260701_081228`.
+
+Default-off, trace-enabled, and post-serving gates were all bit-exact: MoE
+`8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
+`806/806`.
+
+Live n128 serving with `LLAMA_CUBLAS_ROUTE_TRACE=8192` produced:
+
+| cuBLAS route | count |
+|--------------|------:|
+| `bf16_tc` | 5681 |
+| `sgemm` | 2511 |
+
+Top SGEMM shapes were `type=0 row_diff=256/1 src1_ncols=510 ne00=2048
+ne10=2048`. Lever implication: the measured `op_cublas` bucket is BF16
+tensor-core plus F32 SGEMM, not NVFP4 cuBLAS and not batched cuBLAS. The next
+projection phase should explain whether the F32 SGEMM shapes are expected glue
+tensors or a missed BF16 route, with md5/op gates before any route policy A/B.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update