From 3c2cb9f4ab9cd7e754a0e8a9f8d254dfaea75417 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 04:00:14 +0000 Subject: [PATCH] docs(paged): record graph-node serving profile Record the Phase 27 current-stack llama.cpp n128 serving profile captured with CUDA graph node tracing and gated before and after the run. Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 78 ++++++++++++++++ .../docs/PARITY_HANDOFF.md | 13 +++ .../docs/VLLM_PARITY_LEVER_MAP.md | 30 +++++++ ...7-01-graph-node-serving-profile-phase27.md | 89 +++++++++++++++++++ 4 files changed, 210 insertions(+) create mode 100644 docs/superpowers/plans/2026-07-01-graph-node-serving-profile-phase27.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 04909be9d..f4531d7d8 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -1637,3 +1637,81 @@ Decision: - Current paged/vLLM decode ratios remain about `81.5%` at n8, `69.0%` at n32, and `65.7%` at n128; e2e aggregate ratios remain about `70.6%`, `54.6%`, and `49.4%`. + +## Phase 27 Graph-Node-Traced Current-Stack Serving Profile + +Phase 27 re-profiled the current clean llama.cpp serving path with CUDA graph +node tracing enabled. This checks the Phase 8 bucket picture against the decode +profiling rule: serving/decode profiles must use `--cuda-graph-trace=node`. + +Artifact: + +- `/home/mudler/bench/phase27_graph_node_serving/20260701_055519` + +Source and hardware: + +- `/home/mudler/llama-phase6-source` +- `f2521ab12 feat(server): trace speculative batch shapes` +- `GPU 0: NVIDIA GB10`, driver `580.159.03`, compute capability `12.1` +- Nsight Systems `2025.3.2.474-253236389321v0` + +Safety gates: + +| phase | check | status | actual | +|-------|-------|--------|--------| +| pre | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| pre | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| pre | `MUL_MAT_ID` | ok | `806/806` | +| post retry | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post retry | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post retry | `MUL_MAT_ID` | ok | `806/806` | + +The first immediate post-gate attempt raced with Nsight teardown and rejected +the run because it detected one compute process even though `nvidia-smi` already +printed no running processes. The post-gate retry started from `docker=0`, +`local_ai_worker=0`, `compute=0`, and a `FREE` owner file. + +Serving sample (`n=128`, `PTOK=128`, `GEN=64`): + +| agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | TTFT mean ms | +|---------|----------------|--------------------|-------------|--------------| +| 319.9 | 675.5 | 3.9 | 1671.1 | 8363.4 | + +This matches Phase 26's n128 paged decode rate (`673.4` decode_agg_tps) closely +enough to treat the profile as representative for bucket direction. + +Graph-node-traced kernel buckets: + +| macro bucket | time ms | share | +|--------------|---------|-------| +| GDN | 6706.33 | 33.47% | +| MoE/FFN-GEMM | 5871.92 | 29.31% | +| bf16-proj | 2725.07 | 13.60% | +| layout-copy | 1309.99 | 6.54% | +| ew-mul(weight/norm/GDN) | 724.29 | 3.61% | +| act-quant | 697.75 | 3.48% | +| norms/residual | 405.29 | 2.02% | +| ew-add(resid/MoE-fanin) | 361.81 | 1.81% | +| MoE-dispatch | 275.99 | 1.38% | +| FA | 271.03 | 1.35% | + +Fine buckets: + +- `gdn_core`: `5929.85 ms` (`29.59%`) +- `mmq_nvfp4`: `5697.79 ms` (`28.44%`) +- `cublas_bf16_gemm`: `1892.81 ms` (`9.45%`) +- `act_quant`: `697.75 ms` (`3.48%`) +- `mm_ids`: `121.99 ms` (`0.61%`) +- `gather_mmq`: `73.88 ms` (`0.37%`) +- `argsort_topk`: `80.11 ms` (`0.40%`) + +Decision: + +- The graph-node-traced current-stack profile confirms the Phase 8 source + shortcut decision. Metadata/helper work is still too small: `mm_ids`, + `gather_mmq`, and `argsort_topk` together are about `1.38%`. +- A credible GB10 source patch would have to reduce `gdn_core` or + `mmq_nvfp4`/bf16 projection work directly. The low-conflict helper-dispatch + path still should not be reopened. +- The serving profile does not change the Phase 26 parity verdict: n128 paged + decode remains about `675 tok/s`, far below vLLM's same-session `1025 tok/s`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 8b0b5a26e..87bec651b 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -364,6 +364,18 @@ Use Phase 26 as the current audit-grade GB10 snapshot. It keeps the Phase 20 verdict intact, but the artifact is more useful for future regressions because it carries hardware classification and compact pre/post inference gates. +Phase 27 re-profiled the current clean llama.cpp n128 serving path with +`nsys --cuda-graph-trace=node`. Artifact: +`/home/mudler/bench/phase27_graph_node_serving/20260701_055519`. The run matched +Phase 26 throughput closely (`675.5` vs `673.4` decode_agg_tps) and kept gates +green before and after the profile (post retry): MoE md5 +`8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. The node-traced +buckets still put the work in `gdn_core` (`29.59%`) and `mmq_nvfp4` (`28.44%`); +helper dispatch remains too small (`mm_ids` `0.61%`, `gather_mmq` `0.37%`, +`argsort_topk` `0.40%`). Do not reopen metadata/helper-only MoE dispatch work on +GB10. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -430,6 +442,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase24_hardware_report_dryrun/20260701_052741` - current snapshot harness dry run proving `hardware.txt` captures the DGX as `hardware_class=gb10_or_workstation_blackwell`. - `~/bench/phase25_gate_summary_dryrun/20260701_053353` - dry run after adding `gate_summary.tsv` support; normal dry-run still writes `hardware.txt` and does not emit a gate summary before gates exist. - `~/bench/phase26_audited_snapshot/20260701_053650` - current audit-grade full paged-vs-vLLM MoE serving snapshot with `hardware.txt`, pre/post gates, `summary.tsv`, and `gate_summary.tsv`. +- `~/bench/phase27_graph_node_serving/20260701_055519` - current clean llama.cpp n128 serving profile captured with `--cuda-graph-trace=node`, pre/post retry gates green. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 25ec6b3b8..f2eb4abca 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -716,6 +716,36 @@ parity on GB10. Treat Phase 26 as the current benchmark baseline before funding new kernel work, and keep md5/op gates as the first check when changing the patch stack. +### Phase 27 graph-node-traced current-stack profile + +Phase 27 re-profiled the current clean llama.cpp n128 serving path with +`--cuda-graph-trace=node`, using the same source (`f2521ab12`) and GB10 host. +Artifact: `/home/mudler/bench/phase27_graph_node_serving/20260701_055519`. + +The profile run itself reported `decode_agg_tps=675.5`, close to Phase 26's +n128 paged `673.4`, so the trace is representative for bucket direction. Pre +gates passed, and the post-gate retry passed after Nsight teardown finished: +MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +Graph-node-traced macro buckets: + +| bucket | time ms | share | +|--------|---------|-------| +| GDN | 6706.33 | 33.47% | +| MoE/FFN-GEMM | 5871.92 | 29.31% | +| bf16-proj | 2725.07 | 13.60% | +| layout-copy | 1309.99 | 6.54% | +| act-quant | 697.75 | 3.48% | +| MoE-dispatch | 275.99 | 1.38% | +| FA | 271.03 | 1.35% | + +Fine rows keep the same decision shape as Phase 8: `gdn_core` is `29.59%`, +`mmq_nvfp4` is `28.44%`, while `mm_ids` is `0.61%`, `gather_mmq` is `0.37%`, +and `argsort_topk` is `0.40%`. Do not reopen metadata/helper-only MoE dispatch +work on GB10. Any credible source patch must directly reduce GDN, grouped-MMQ, +or projection work and still pass the md5/op gates. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-graph-node-serving-profile-phase27.md b/docs/superpowers/plans/2026-07-01-graph-node-serving-profile-phase27.md new file mode 100644 index 000000000..066803fc5 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-graph-node-serving-profile-phase27.md @@ -0,0 +1,89 @@ +# Graph Node Serving Profile Phase 27 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Re-profile the current clean llama.cpp serving path with CUDA graph +node tracing so source decisions are based on the required decode profiling +method. + +**Architecture:** This is a profile-only phase. It does not edit llama.cpp +source. It runs md5/op gates before and after a graph-node-traced n128 serving +profile, then records whether the bucket decomposition changes the Phase 8 +helper-dispatch decision. + +**Tech Stack:** DGX GB10, llama.cpp CUDA backend, Nsight Systems +`--cuda-graph-trace=node`, `paged-inference-gates.sh`, LocalAI parity docs. + +--- + +## Checklist + +- [x] **Step 1: Confirm the profiling gap** + - Phase 8 used an ordinary Nsight serving profile. + - Current handoff requires `--cuda-graph-trace=node` for decode/serving + profiles because CUDA graph replay can collapse kernel attribution. + +- [x] **Step 2: Check DGX preflight before gates** + - `docker=0` + - `local_ai_worker=0` + - `compute=0` + - GPU owner file was `FREE`. + +- [x] **Step 3: Run pre-profile inference gates** + - Artifact: `/home/mudler/bench/phase27_graph_node_serving/20260701_055519/gate_pre` + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +- [x] **Step 4: Fix Nsight session-control syntax** + - A first attempt failed because `nsys launch` on Nsight Systems + `2025.3.2.474-253236389321v0` rejects `--cpuctxsw`. + - A smoke test showed the correct split: + `nsys launch --trace=cuda --cuda-graph-trace=node ...` and + `nsys start --sample=none --cpuctxsw=none -o OUT`. + - Do not put `--trace`, `--cuda-graph-trace`, or `--cpuctxsw` all on both + commands for this Nsight version. + +- [x] **Step 5: Run graph-node-traced n128 serving profile** + - Artifact: `/home/mudler/bench/phase27_graph_node_serving/20260701_055519` + - Source: `f2521ab12 feat(server): trace speculative batch shapes` + - Hardware: `GPU 0: NVIDIA GB10`, driver `580.159.03`, compute `12.1` + - Serving shape: `n=128`, `PTOK=128`, `GEN=64` + - Client result: `decode_agg_tps=675.5`, `agg_tps=319.9`, + `prefill_tps=1671.1`, `TTFT mean=8363.4 ms` + +- [x] **Step 6: Run post-profile inference gates** + - The immediate post-gate raced with Nsight teardown and reported one compute + process even though `nvidia-smi` printed no running processes. + - Retried after idle preflight: + `/home/mudler/bench/phase27_graph_node_serving/20260701_055519/gate_post_retry` + - Retry MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Retry dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - Retry `MUL_MAT_ID`: `806/806` + +- [x] **Step 7: Bucket the graph-node trace** + - `buckets.txt` was generated from + `llama_graph_node.nsys-rep`. + - Macro buckets: + - GDN: `6706.33 ms` (`33.47%`) + - MoE/FFN-GEMM: `5871.92 ms` (`29.31%`) + - bf16-proj: `2725.07 ms` (`13.60%`) + - layout-copy: `1309.99 ms` (`6.54%`) + - act-quant: `697.75 ms` (`3.48%`) + - MoE-dispatch: `275.99 ms` (`1.38%`) + - FA: `271.03 ms` (`1.35%`) + +- [x] **Step 8: Record decision** + - Fine rows confirm the Phase 8 source shortcut rejection: + `gdn_core=29.59%`, `mmq_nvfp4=28.44%`, `mm_ids=0.61%`, + `gather_mmq=0.37%`, `argsort_topk=0.40%`. + - Do not reopen metadata/helper-only MoE dispatch work on GB10. + - A credible patch must directly reduce GDN, grouped-MMQ, or projection work + while preserving md5/op gates. + +## Result + +Phase 27 strengthens the profile basis for the current GB10 conclusion. It does +not find a new low-conflict source shortcut. The profile is representative of +Phase 26 n128 serving throughput and keeps the inference gates green after a +post-teardown retry.