From fbdc2008866bf3eaba1bbe48db7c1931a55e48d1 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 06:24:46 +0000 Subject: [PATCH] feat(paged): add cublas route trace patch Add patch 0062 with default-off LLAMA_CUBLAS_ROUTE_TRACE instrumentation for generic cuBLAS MUL_MAT subroutes. Record Phase 36 DGX gates, serving trace results, and the next projection follow-up scope. Assisted-by: Codex:gpt-5 --- backend/cpp/llama-cpp-localai-paged/README.md | 3 +- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 56 +++ .../docs/PARITY_HANDOFF.md | 23 +- .../docs/PATCH_MAINTENANCE.md | 8 +- .../docs/VLLM_PARITY_LEVER_MAP.md | 24 ++ .../0062-feat-cuda-trace-cublas-routes.patch | 332 ++++++++++++++++++ .../2026-07-01-cublas-route-trace-phase36.md | 72 ++++ 7 files changed, 508 insertions(+), 10 deletions(-) create mode 100644 backend/cpp/llama-cpp-localai-paged/patches/paged/0062-feat-cuda-trace-cublas-routes.patch create mode 100644 docs/superpowers/plans/2026-07-01-cublas-route-trace-phase36.md diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md index 808bb326c..e7192d543 100644 --- a/backend/cpp/llama-cpp-localai-paged/README.md +++ b/backend/cpp/llama-cpp-localai-paged/README.md @@ -87,7 +87,7 @@ orthogonal to the paged allocator. --- -## 3. Patch series (0001-0061) +## 3. Patch series (0001-0062) Source-only patches, with intentional numbering gaps (e.g. 0005, 0027). The decode-serving graph-reuse levers are 0040-0041. "Bit-exact" = greedy md5 / @@ -219,6 +219,7 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact | 0059 | **Gate MoE small-M MMQ tile policy** - adds default-off `LLAMA_MOE_SMALL_M_TILE=` to cap only classified small-M MoE grouped-MMQ calls. This was used to A/B vLLM-like smaller M blocks without changing default inference. | yes (default-off, tile16, tile8, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 33 rejected tile16 and tile8 as slower) | | 0060 | **Trace MoE MMID dispatch routes** - adds default-off `LLAMA_MOE_MMID_ROUTE_TRACE=` around `MUL_MAT_ID` dispatch, classifying each call as `mmvq`, `mmvf`, grouped `mmq`, `mmf`, or host-sync `fallback`. This is evidence-only instrumentation to resolve whether serving hits the per-expert host-sync fallback. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 34 n128 trace found `mmq=2776`, `mmvq=1320`, `host_sync=0/4096`) | | 0061 | **Trace regular MUL_MAT dispatch routes** - adds default-off `LLAMA_MUL_MAT_ROUTE_TRACE=` around regular `MUL_MAT`, classifying projection-heavy calls as `vec_f`, `mat_f`, `vec_q`, `mmq`, `batched_cublas`, `op_*`, `fp4_prefill`, or `fwht`. This is evidence-only instrumentation for the `bf16-proj` serving bucket. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`; Phase 35 n128 trace found BF16 routes `mat_f=2485`, `op_cublas=1330`) | +| 0062 | **Trace cuBLAS subroutes** - adds default-off `LLAMA_CUBLAS_ROUTE_TRACE=` around the generic cuBLAS `MUL_MAT` path, classifying calls as `nvfp4_bf16_tc`, `bf16_tc`, `f16_tc_32f`, `f16_tc_16f`, or `sgemm`. This is evidence-only instrumentation for the Phase 35 `op_cublas` bucket. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`; Phase 36 n128 trace found `bf16_tc=5681`, `sgemm=2511`) | > **Dropped: patch 0026 (hybrid per-head bf16 SSM state, `ssm_bf16_tau`).** Once > the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index e36fd74a2..0aed02038 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -2104,3 +2104,59 @@ Decision: - Next projection work should either add a cuBLAS/MMF subroute trace or test a bounded BF16 route policy for the `op_cublas` shapes. Do not chase batched cuBLAS for this measured serving slice. + +## Phase 36 cuBLAS Subroute Trace + +Phase 36 added patch `0062`, a default-off `LLAMA_CUBLAS_ROUTE_TRACE=` +diagnostic around the generic cuBLAS `MUL_MAT` path. It does not alter branch +behavior; it classifies existing calls as `nvfp4_bf16_tc`, `bf16_tc`, +`f16_tc_32f`, `f16_tc_16f`, or `sgemm`. + +Artifact: + +- `/home/mudler/bench/phase36_cublas_route_trace/20260701_081228` + +Run: + +- Fork commit: `/home/mudler/_git/llama.cpp` `38c4ef2e4` +- DGX mirror commit: `dgx:~/llama-phase6-source` `e0224393a` +- Env: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 LLAMA_CUBLAS_ROUTE_TRACE=8192` +- Workload: staggered n128 `llama-server` diagnostic trace + +Route summary: + +| route | count | +|-------|------:| +| `bf16_tc` | 5681 | +| `sgemm` | 2511 | + +Top shapes: + +| route | shape | count | +|-------|-------|------:| +| `bf16_tc` | `type=30 row_diff=32 src1_ncols=510 ne00=2048 ne10=2048` | 360 | +| `bf16_tc` | `type=30 row_diff=8192 src1_ncols=510 ne00=2048 ne10=2048` | 240 | +| `bf16_tc` | `type=30 row_diff=2048 src1_ncols=510 ne00=4096 ne10=4096` | 240 | +| `sgemm` | `type=0 row_diff=256 src1_ncols=510 ne00=2048 ne10=2048` | 240 | +| `sgemm` | `type=0 row_diff=1 src1_ncols=510 ne00=2048 ne10=2048` | 240 | + +Gates: + +| check | status | actual | +|-------|--------|--------| +| default-off MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| default-off dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| trace-enabled MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| trace-enabled dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post-serving MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post-serving dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT` | ok | `1146/1146` default, trace, post-serving | +| `MUL_MAT_ID` | ok | `806/806` default, trace, post-serving | + +Decision: + +- Phase 35's generic `op_cublas` bucket is BF16 tensor-core plus F32 SGEMM in + this serving slice. It is not NVFP4 cuBLAS and not batched cuBLAS. +- The next projection phase should identify whether the `type=0` SGEMM shapes + are expected glue tensors or a missed BF16 route. Do not change routing until + a separately gated policy proves md5/op safety. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index bbaf512c0..dd8e3c96b 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -466,6 +466,18 @@ MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense was split `mat_f=2485`, `op_cublas=1330`. Next projection work should target BF16 `mat_f`/`op_cublas` subroute evidence or route policy, not batched cuBLAS. +Phase 36 added patch `0062`, default-off `LLAMA_CUBLAS_ROUTE_TRACE=`, to +classify the generic cuBLAS `MUL_MAT` subroute without changing branch behavior. +Artifact: `/home/mudler/bench/phase36_cublas_route_trace/20260701_081228`. +Fork commit: `38c4ef2e4 feat(cuda): trace cublas routes`; DGX mirror commit: +`e0224393a`. Default-off, trace-enabled, and post-serving gates stayed green: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` +`806/806`. Live n128 serving with trace cap 8192 found `bf16_tc=5681` and +`sgemm=2511`. The next projection phase should explain whether the F32 SGEMM +shapes are expected glue tensors or a missed BF16 route; do not chase NVFP4 +cuBLAS or batched cuBLAS for this measured bucket. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -515,15 +527,15 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual ## 7. KEY FILE / ARTIFACT INDEX ### Fork (canonical source of truth) -- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `486c28c63d5297afd06e5a2bdbd4fb89cad749cd` ("trace mul mat routes", patch `0061`). -- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `18f7ad005` with the Phase 35 regular MUL_MAT route-trace patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes. +- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `38c4ef2e4` ("trace cublas routes", patch `0062`). +- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `e0224393a` with the Phase 36 cuBLAS route-trace patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes. - Historical DGX dev tree: `dgx:~/llama-paged-dev`, branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). It is an old experimental tree and must not be treated as canonical. ### LocalAI worktree - Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (currently 246 ahead, 31 behind `origin/master`; recompute before reporting). - Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`). - `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`. -- `patches/paged/`: **52** `.patch` files spanning 0001-0061 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055; MoE MMQ selector/launch/candidate/tile-policy/route instrumentation is 0056-0060; regular MUL_MAT route instrumentation is 0061. +- `patches/paged/`: **53** `.patch` files spanning 0001-0062 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055; MoE MMQ selector/launch/candidate/tile-policy/route instrumentation is 0056-0060; regular MUL_MAT route instrumentation is 0061; cuBLAS route instrumentation is 0062. ### Bench artifacts (DGX) - `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - historical same-session both-engine run. @@ -541,6 +553,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase33_small_m_tile_policy/20260701_071136` - default-off MoE MMQ small-M tile policy patch `0059`; tile16/tile8 md5/op safe but both slower in n128 serving. - `~/bench/phase34_mmid_route_trace/20260701_072737` - default-off MoE MMID route trace patch `0060`; default/trace/post-serving md5 gates green; n128 route trace found `mmq=2776`, `mmvq=1320`, `host_sync=0/4096`. - `~/bench/phase35_mul_mat_route_trace/20260701_074359` - default-off regular MUL_MAT route trace patch `0061`; default/trace/post-serving md5 gates green; n128 route trace found BF16 `mat_f=2485`, `op_cublas=1330`. +- `~/bench/phase36_cublas_route_trace/20260701_081228` - default-off cuBLAS subroute trace patch `0062`; default/trace/post-serving md5 and op gates green; n128 route trace found `bf16_tc=5681`, `sgemm=2511`. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. @@ -553,8 +566,8 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual ### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels) 1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building. -2. **Current fork/mirror are clean and verified.** Local fork HEAD is `486c28c63`, DGX clean mirror HEAD is `18f7ad005`, and Phase 35 should be treated as the current patch-series tip. The old `llama-paged-dev` tree is historical only. -3. **Worktree patch series is tracked through 0061.** The only expected unrelated untracked path in this worktree is `.claude/`. +2. **Current fork/mirror are clean and verified.** Local fork HEAD is `38c4ef2e4`, DGX clean mirror HEAD is `e0224393a`, and Phase 36 should be treated as the current patch-series tip. The old `llama-paged-dev` tree is historical only. +3. **Worktree patch series is tracked through 0062.** The only expected unrelated untracked path in this worktree is `.claude/`. 4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch). 5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md b/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md index 13457afa5..4c201da06 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md @@ -57,18 +57,18 @@ everywhere without ever touching the stock `llama-cpp` source tree. ## Latest mirror check -Phase 35 re-verified the mirror invariant after adding patch `0061`: +Phase 36 re-verified the mirror invariant after adding patch `0062`: ```text base=0ed235ea2c17a19fc8238668653946721ed136fd -applied_tree=305ebb96801822f2132ed9e9c868308b0759c7b9 -fork_tree=305ebb96801822f2132ed9e9c868308b0759c7b9 +applied_tree=208189d119efe27477f1900cc6f7428bd1720449 +fork_tree=208189d119efe27477f1900cc6f7428bd1720449 ``` The check used a fresh worktree at `LLAMA_VERSION`, applied every `patches/paged/0*.patch` with strict `git apply`, staged the result, and compared `git write-tree` to canonical fork branch `localai-paged` at -`486c28c63 feat(cuda): trace mul mat routes`. +`38c4ef2e4 feat(cuda): trace cublas routes`. ## Status diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index f3ccbe57e..2afd1a052 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -931,6 +931,30 @@ in a future serving configuration, first isolate `mtp_eh_proj` / shared-head projection with `llama-debug --tensor-filter 'mtp_|h_nextn|nextn|ffn_|attn_'` before optimizing ordinary decoder projections. +### Phase 36 cuBLAS subroute trace + +Phase 36 added patch `0062`, default-off `LLAMA_CUBLAS_ROUTE_TRACE=`, to +classify the generic cuBLAS `MUL_MAT` subroute without changing branch behavior. +Artifact: `/home/mudler/bench/phase36_cublas_route_trace/20260701_081228`. + +Default-off, trace-enabled, and post-serving gates were all bit-exact: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +Live n128 serving with `LLAMA_CUBLAS_ROUTE_TRACE=8192` produced: + +| cuBLAS route | count | +|--------------|------:| +| `bf16_tc` | 5681 | +| `sgemm` | 2511 | + +Top SGEMM shapes were `type=0 row_diff=256/1 src1_ncols=510 ne00=2048 +ne10=2048`. Lever implication: the measured `op_cublas` bucket is BF16 +tensor-core plus F32 SGEMM, not NVFP4 cuBLAS and not batched cuBLAS. The next +projection phase should explain whether the F32 SGEMM shapes are expected glue +tensors or a missed BF16 route, with md5/op gates before any route policy A/B. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0062-feat-cuda-trace-cublas-routes.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0062-feat-cuda-trace-cublas-routes.patch new file mode 100644 index 000000000..4bac09e4e --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0062-feat-cuda-trace-cublas-routes.patch @@ -0,0 +1,332 @@ +From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Wed, 1 Jul 2026 06:20:31 +0000 +Subject: [PATCH] feat(cuda): trace cublas routes + +Add a default-off LLAMA_CUBLAS_ROUTE_TRACE diagnostic around the generic cuBLAS MUL_MAT path. + +The trace classifies NVFP4/BF16/FP16/SGEMM subroutes without changing branch behavior, and extends the route helper test coverage. + +Assisted-by: Codex:gpt-5 +--- + ggml/src/ggml-cuda/ggml-cuda.cu | 53 +++++++++++-- + ggml/src/ggml-cuda/mmq-shape-trace.h | 108 +++++++++++++++++++++++++++ + tests/test-cuda-mmq-shape-trace.cpp | 62 +++++++++++++++ + 3 files changed, 216 insertions(+), 7 deletions(-) + +diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu +index cd34aff13..eff197818 100644 +--- a/ggml/src/ggml-cuda/ggml-cuda.cu ++++ b/ggml/src/ggml-cuda/ggml-cuda.cu +@@ -1627,6 +1627,32 @@ static const cublas_force_compute_type & ggml_cuda_cublas_get_force_compute_type + return compute_type; + } + ++static inline int ggml_cuda_cublas_route_trace_limit() { ++ static const int value = []() { ++ const char * s = getenv("LLAMA_CUBLAS_ROUTE_TRACE"); ++ return s ? atoi(s) : 0; ++ }(); ++ ++ return value; ++} ++ ++static inline void ggml_cuda_cublas_route_trace(const ggml_cuda_cublas_route_shape & shape) { ++ const int trace_limit = ggml_cuda_cublas_route_trace_limit(); ++ if (trace_limit <= 0) { ++ return; ++ } ++ ++ static std::atomic trace_count{0}; ++ const int trace_idx = trace_count.fetch_add(1, std::memory_order_relaxed); ++ if (trace_idx >= trace_limit) { ++ return; ++ } ++ ++ char buf[320]; ++ ggml_cuda_cublas_route_shape_format(buf, sizeof(buf), shape); ++ fprintf(stderr, "[LLAMA_CUBLAS_ROUTE] %s\n", buf); ++} ++ + static void ggml_cuda_op_mul_mat_cublas( + ggml_backend_cuda_context & ctx, + const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, const char * src0_dd_i, const float * src1_ddf_i, +@@ -1662,7 +1688,22 @@ static void ggml_cuda_op_mul_mat_cublas( + row_diff == src0->ne[1] && + dst->op_params[0] == GGML_PREC_DEFAULT; + +- if (supports_bf16 && src0->type == GGML_TYPE_NVFP4 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) { ++ const bool src0_contig = ggml_is_contiguous(src0); ++ const bool full_rows = row_diff == src0->ne[1]; ++ const bool fast_fp16 = fast_fp16_hardware_available(cc); ++ bool force_fp32 = false; ++ bool force_fp16 = false; ++ if (fast_fp16 && use_fp16) { ++ const auto & force_compute_type = ggml_cuda_cublas_get_force_compute_type(); ++ force_fp32 = force_compute_type.fp32; ++ force_fp16 = force_compute_type.fp16; ++ } ++ ggml_cuda_cublas_route_trace(ggml_cuda_cublas_route_shape_make( ++ src0->type, src1->type, row_diff, src1_ncols, ne00, ne10, ldc, ++ supports_bf16, use_fp16, fast_fp16, force_fp32, force_fp16, src0_contig, full_rows, ++ GGML_CUDA_CC_IS_CDNA(cc), GGML_CUDA_CC_IS_RDNA4(cc), cc == GGML_CUDA_CC_VOLTA)); ++ ++ if (supports_bf16 && src0->type == GGML_TYPE_NVFP4 && src0_contig && full_rows) { + // Paged prefill lever (patch 0033): NVFP4 only reaches cuBLAS when + // ggml_cuda_should_use_mmq() returned false (large-M dense prefill). + // Dequant the FP4 weights to a TRANSIENT bf16 pool buffer and run a +@@ -1702,7 +1743,7 @@ static void ggml_cuda_op_mul_mat_cublas( + + const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(GGML_TYPE_BF16); + to_fp32_cuda(dst_bf16.get(), dst_dd_i, row_diff*src1_ncols, stream); +- } else if (supports_bf16 && src0->type == GGML_TYPE_BF16 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) { ++ } else if (supports_bf16 && src0->type == GGML_TYPE_BF16 && src0_contig && full_rows) { + ggml_cuda_pool_alloc src1_as_bf16(ctx.pool(id)); + if (src1->type != GGML_TYPE_BF16) { + const to_bf16_cuda_t to_bf16_cuda = ggml_get_to_bf16_cuda(src1->type); +@@ -1730,7 +1771,7 @@ static void ggml_cuda_op_mul_mat_cublas( + + const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(GGML_TYPE_BF16); + to_fp32_cuda(dst_bf16.get(), dst_dd_i, row_diff*src1_ncols, stream); +- } else if (fast_fp16_hardware_available(cc) && use_fp16) { ++ } else if (fast_fp16 && use_fp16) { + // convert src0 and src1 to fp16, multiply as fp16, convert dst to fp32 + ggml_cuda_pool_alloc src0_as_f16(ctx.pool(id)); + if (src0->type != GGML_TYPE_F16) { +@@ -1754,12 +1795,10 @@ static void ggml_cuda_op_mul_mat_cublas( + + CUBLAS_CHECK(cublasSetStream(ctx.cublas_handle(id), stream)); + +- const auto & force_compute_type = ggml_cuda_cublas_get_force_compute_type(); +- +- if (!force_compute_type.fp16 && (GGML_CUDA_CC_IS_CDNA(cc) ++ if (!force_fp16 && (GGML_CUDA_CC_IS_CDNA(cc) + || GGML_CUDA_CC_IS_RDNA4(cc) + || cc == GGML_CUDA_CC_VOLTA +- || force_compute_type.fp32)) ++ || force_fp32)) + { + const float alpha = 1.0f; + const float beta = 0.0f; +diff --git a/ggml/src/ggml-cuda/mmq-shape-trace.h b/ggml/src/ggml-cuda/mmq-shape-trace.h +index 8ac373fd9..f5b4ecf2c 100644 +--- a/ggml/src/ggml-cuda/mmq-shape-trace.h ++++ b/ggml/src/ggml-cuda/mmq-shape-trace.h +@@ -85,6 +85,14 @@ enum ggml_cuda_mul_mat_route { + GGML_CUDA_MUL_MAT_ROUTE_OP_CUBLAS, + }; + ++enum ggml_cuda_cublas_route { ++ GGML_CUDA_CUBLAS_ROUTE_NVFP4_BF16_TC, ++ GGML_CUDA_CUBLAS_ROUTE_BF16_TC, ++ GGML_CUDA_CUBLAS_ROUTE_F16_TC_32F, ++ GGML_CUDA_CUBLAS_ROUTE_F16_TC_16F, ++ GGML_CUDA_CUBLAS_ROUTE_SGEMM, ++}; ++ + struct ggml_cuda_mul_mat_route_shape { + ggml_cuda_mul_mat_route route; + int type; +@@ -102,6 +110,27 @@ struct ggml_cuda_mul_mat_route_shape { + bool use_fwht; + }; + ++struct ggml_cuda_cublas_route_shape { ++ ggml_cuda_cublas_route route; ++ int type; ++ int src1_type; ++ int64_t row_diff; ++ int64_t src1_ncols; ++ int64_t ne00; ++ int64_t ne10; ++ int64_t ldc; ++ bool supports_bf16; ++ bool use_fp16; ++ bool fast_fp16; ++ bool force_fp32; ++ bool force_fp16; ++ bool src0_contig; ++ bool full_rows; ++ bool is_cdna; ++ bool is_rdna4; ++ bool is_volta; ++}; ++ + static inline const char * ggml_cuda_mmid_route_name(const ggml_cuda_mmid_route route) { + switch (route) { + case GGML_CUDA_MMID_ROUTE_MMVQ: return "mmvq"; +@@ -132,6 +161,18 @@ static inline const char * ggml_cuda_mul_mat_route_name(const ggml_cuda_mul_mat_ + return "unknown"; + } + ++static inline const char * ggml_cuda_cublas_route_name(const ggml_cuda_cublas_route route) { ++ switch (route) { ++ case GGML_CUDA_CUBLAS_ROUTE_NVFP4_BF16_TC: return "nvfp4_bf16_tc"; ++ case GGML_CUDA_CUBLAS_ROUTE_BF16_TC: return "bf16_tc"; ++ case GGML_CUDA_CUBLAS_ROUTE_F16_TC_32F: return "f16_tc_32f"; ++ case GGML_CUDA_CUBLAS_ROUTE_F16_TC_16F: return "f16_tc_16f"; ++ case GGML_CUDA_CUBLAS_ROUTE_SGEMM: return "sgemm"; ++ } ++ ++ return "unknown"; ++} ++ + static inline ggml_cuda_mmq_shape ggml_cuda_mmq_shape_make( + const int type, const bool is_moe, const int64_t ncols_dst, const int64_t nchannels_x, + const int64_t ncols_max, const int mmq_x_max, const int mmq_x_lim, const int mmq_x_best, +@@ -205,6 +246,47 @@ static inline ggml_cuda_mul_mat_route_shape ggml_cuda_mul_mat_route_shape_make( + }; + } + ++static inline ggml_cuda_cublas_route_shape ggml_cuda_cublas_route_shape_make( ++ const int type, const int src1_type, const int64_t row_diff, const int64_t src1_ncols, ++ const int64_t ne00, const int64_t ne10, const int64_t ldc, const bool supports_bf16, ++ const bool use_fp16, const bool fast_fp16, const bool force_fp32, const bool force_fp16, ++ const bool src0_contig, const bool full_rows, const bool is_cdna, const bool is_rdna4, ++ const bool is_volta) { ++ ggml_cuda_cublas_route route = GGML_CUDA_CUBLAS_ROUTE_SGEMM; ++ if (supports_bf16 && type == 40 && src0_contig && full_rows) { ++ route = GGML_CUDA_CUBLAS_ROUTE_NVFP4_BF16_TC; ++ } else if (supports_bf16 && type == 30 && src0_contig && full_rows) { ++ route = GGML_CUDA_CUBLAS_ROUTE_BF16_TC; ++ } else if (fast_fp16 && use_fp16) { ++ if (!force_fp16 && (is_cdna || is_rdna4 || is_volta || force_fp32)) { ++ route = GGML_CUDA_CUBLAS_ROUTE_F16_TC_32F; ++ } else { ++ route = GGML_CUDA_CUBLAS_ROUTE_F16_TC_16F; ++ } ++ } ++ ++ return { ++ route, ++ type, ++ src1_type, ++ row_diff, ++ src1_ncols, ++ ne00, ++ ne10, ++ ldc, ++ supports_bf16, ++ use_fp16, ++ fast_fp16, ++ force_fp32, ++ force_fp16, ++ src0_contig, ++ full_rows, ++ is_cdna, ++ is_rdna4, ++ is_volta, ++ }; ++} ++ + static inline ggml_cuda_mmid_route_shape ggml_cuda_mmid_route_shape_make( + const int type, const int64_t ne2, const int64_t ne12, const int64_t n_experts, + const int mmvq_max, const bool use_mmq, const bool use_mmf, const bool is_amd, +@@ -377,6 +459,32 @@ static inline int ggml_cuda_mul_mat_route_shape_format( + shape.use_fwht ? 1 : 0); + } + ++static inline int ggml_cuda_cublas_route_shape_format( ++ char * buf, const size_t size, const ggml_cuda_cublas_route_shape & shape) { ++ return std::snprintf(buf, size, ++ "route=%s type=%d src1_type=%d row_diff=%lld src1_ncols=%lld ne00=%lld ne10=%lld ldc=%lld " ++ "supports_bf16=%d use_fp16=%d fast_fp16=%d force_fp32=%d force_fp16=%d " ++ "src0_contig=%d full_rows=%d is_cdna=%d is_rdna4=%d is_volta=%d", ++ ggml_cuda_cublas_route_name(shape.route), ++ shape.type, ++ shape.src1_type, ++ (long long) shape.row_diff, ++ (long long) shape.src1_ncols, ++ (long long) shape.ne00, ++ (long long) shape.ne10, ++ (long long) shape.ldc, ++ shape.supports_bf16 ? 1 : 0, ++ shape.use_fp16 ? 1 : 0, ++ shape.fast_fp16 ? 1 : 0, ++ shape.force_fp32 ? 1 : 0, ++ shape.force_fp16 ? 1 : 0, ++ shape.src0_contig ? 1 : 0, ++ shape.full_rows ? 1 : 0, ++ shape.is_cdna ? 1 : 0, ++ shape.is_rdna4 ? 1 : 0, ++ shape.is_volta ? 1 : 0); ++} ++ + static inline int ggml_cuda_mmq_small_m_shape_format( + char * buf, const size_t size, const ggml_cuda_mmq_small_m_shape & shape) { + return std::snprintf(buf, size, +diff --git a/tests/test-cuda-mmq-shape-trace.cpp b/tests/test-cuda-mmq-shape-trace.cpp +index 2bd41d1d8..1443749c3 100644 +--- a/tests/test-cuda-mmq-shape-trace.cpp ++++ b/tests/test-cuda-mmq-shape-trace.cpp +@@ -285,5 +285,67 @@ int main() { + require(std::strstr(buf, "use_batched_cublas=0") != nullptr, + "regular MUL_MAT trace includes batched cuBLAS predicate"); + ++ const ggml_cuda_cublas_route_shape bf16_tc = ggml_cuda_cublas_route_shape_make( ++ /* type */ 30, ++ /* src1_type */ 0, ++ /* row_diff */ 18, ++ /* src1_ncols */ 18, ++ /* ne00 */ 1024, ++ /* ne10 */ 1024, ++ /* ldc */ 18, ++ /* supports_bf16 */ true, ++ /* use_fp16 */ false, ++ /* fast_fp16 */ true, ++ /* force_fp32 */ false, ++ /* force_fp16 */ false, ++ /* src0_contig */ true, ++ /* full_rows */ true, ++ /* is_cdna */ false, ++ /* is_rdna4 */ false, ++ /* is_volta */ false); ++ ++ require(bf16_tc.route == GGML_CUDA_CUBLAS_ROUTE_BF16_TC, ++ "cuBLAS records native BF16 tensor-core route"); ++ ++ const ggml_cuda_cublas_route_shape nvfp4_bf16_tc = ggml_cuda_cublas_route_shape_make( ++ /* type */ 40, 0, 128, 128, 1024, 1024, 128, true, false, true, false, false, true, true, ++ false, false, false); ++ ++ require(nvfp4_bf16_tc.route == GGML_CUDA_CUBLAS_ROUTE_NVFP4_BF16_TC, ++ "cuBLAS records NVFP4 dequant-to-BF16 tensor-core route"); ++ ++ const ggml_cuda_cublas_route_shape f16_tc_16f = ggml_cuda_cublas_route_shape_make( ++ /* type */ 1, 0, 64, 64, 1024, 1024, 64, false, true, true, false, false, true, true, ++ false, false, false); ++ ++ require(f16_tc_16f.route == GGML_CUDA_CUBLAS_ROUTE_F16_TC_16F, ++ "cuBLAS records default FP16 tensor-core 16F compute route"); ++ ++ const ggml_cuda_cublas_route_shape f16_tc_32f = ggml_cuda_cublas_route_shape_make( ++ /* type */ 1, 0, 64, 64, 1024, 1024, 64, false, true, true, true, false, true, true, ++ false, false, false); ++ ++ require(f16_tc_32f.route == GGML_CUDA_CUBLAS_ROUTE_F16_TC_32F, ++ "cuBLAS records forced FP16 tensor-core 32F compute route"); ++ ++ const ggml_cuda_cublas_route_shape sgemm = ggml_cuda_cublas_route_shape_make( ++ /* type */ 0, 0, 12, 12, 1024, 1024, 12, false, false, true, false, false, true, true, ++ false, false, false); ++ ++ require(sgemm.route == GGML_CUDA_CUBLAS_ROUTE_SGEMM, ++ "cuBLAS records SGEMM fallback route"); ++ ++ const int cublas_route_n = ggml_cuda_cublas_route_shape_format(buf, sizeof(buf), bf16_tc); ++ ++ require(cublas_route_n > 0, "cuBLAS route format returns byte count"); ++ require(std::strstr(buf, "route=bf16_tc") != nullptr, "cuBLAS trace includes route name"); ++ require(std::strstr(buf, "type=30") != nullptr, "cuBLAS trace includes src0 type"); ++ require(std::strstr(buf, "src1_type=0") != nullptr, "cuBLAS trace includes src1 type"); ++ require(std::strstr(buf, "row_diff=18") != nullptr, "cuBLAS trace includes row count"); ++ require(std::strstr(buf, "src1_ncols=18") != nullptr, "cuBLAS trace includes source column count"); ++ require(std::strstr(buf, "supports_bf16=1") != nullptr, "cuBLAS trace includes BF16 predicate"); ++ require(std::strstr(buf, "force_fp32=0") != nullptr, "cuBLAS trace includes forced compute predicate"); ++ require(std::strstr(buf, "src0_contig=1") != nullptr, "cuBLAS trace includes contiguity predicate"); ++ + return 0; + } +-- +2.43.0 + diff --git a/docs/superpowers/plans/2026-07-01-cublas-route-trace-phase36.md b/docs/superpowers/plans/2026-07-01-cublas-route-trace-phase36.md new file mode 100644 index 000000000..6c651af7d --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-cublas-route-trace-phase36.md @@ -0,0 +1,72 @@ +# Phase 36: cuBLAS Route Trace + +**Status:** DONE. + +**Scope:** llama.cpp fork first, then LocalAI patch `0062`. Instrumentation only; +no route, branch, or numeric behavior change. + +## Checklist + +- [x] Add RED/GREEN helper tests for cuBLAS subroute classification. +- [x] Add default-off `LLAMA_CUBLAS_ROUTE_TRACE=` around generic cuBLAS + `MUL_MAT` dispatch. +- [x] Build CUDA targets on DGX. +- [x] Run md5 gates with trace off and trace on. +- [x] Run backend op gates with trace off and trace on. +- [x] Capture n128 serving route distribution. +- [x] Run post-serving md5/op gates. +- [x] Commit fork and DGX mirror, export LocalAI patch `0062`. + +## Result + +Artifact: `/home/mudler/bench/phase36_cublas_route_trace/20260701_081228`. + +- Local fork commit: `38c4ef2e4 feat(cuda): trace cublas routes` +- DGX mirror commit: `e0224393a feat(cuda): trace cublas routes` +- Local/DGX tree after Phase 36: `208189d119efe27477f1900cc6f7428bd1720449` +- LocalAI patch: `backend/cpp/llama-cpp-localai-paged/patches/paged/0062-feat-cuda-trace-cublas-routes.patch` + +## Gates + +| check | status | actual | +|-------|--------|--------| +| default-off MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| default-off dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| trace-enabled MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| trace-enabled dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post-serving MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post-serving dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT` | ok | `1146/1146` default, trace, post-serving | +| `MUL_MAT_ID` | ok | `806/806` default, trace, post-serving | + +## Serving Trace + +`LLAMA_CUBLAS_ROUTE_TRACE=8192`, n128 MoE serving: + +| cuBLAS route | count | +|--------------|------:| +| `bf16_tc` | 5681 | +| `sgemm` | 2511 | + +Top shapes: + +| route | shape | count | +|-------|-------|------:| +| `bf16_tc` | `type=30 row_diff=32 src1_ncols=510 ne00=2048 ne10=2048` | 360 | +| `bf16_tc` | `type=30 row_diff=8192 src1_ncols=510 ne00=2048 ne10=2048` | 240 | +| `bf16_tc` | `type=30 row_diff=2048 src1_ncols=510 ne00=4096 ne10=4096` | 240 | +| `sgemm` | `type=0 row_diff=256 src1_ncols=510 ne00=2048 ne10=2048` | 240 | +| `sgemm` | `type=0 row_diff=1 src1_ncols=510 ne00=2048 ne10=2048` | 240 | + +The traced serving run is diagnostic only: heavy stderr tracing depressed +throughput and the client window reported disconnects at shutdown. The +post-serving md5/op gates above stayed green. + +## Decision + +- Generic cuBLAS serving calls are BF16 tensor-core and F32 SGEMM; the measured + route does not show NVFP4 cuBLAS or batched cuBLAS as the next bucket. +- The next projection phase should investigate why the F32 SGEMM shapes remain + `type=0` and whether they are expected glue/projection tensors or a missed + BF16 route. Any route-policy change must be separately gated by the same md5 + and `test-backend-ops` checks before benchmarking.