docs(paged): record current MoE graph profile phase

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 14:56:39 +00:00
parent 26a41fad1a
commit f21b393746
2 changed files with 102 additions and 14 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
@@ -12,13 +12,12 @@ with artifact path, gates, benchmark rows, and decision.
 - Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
 - Current tested source: DGX mirror
  `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Latest attempt: Phase75.
- Latest decision: subagent codebase audit found no source-funded GB10 GDN
-  backend change. Phase74 rejects the C=64 inverse scaffold; vLLM's distinct
-  one-token recurrent decode path is not on the current llama.cpp critical path
-  because prior profiles showed GDN decode already faster than vLLM and serving
-  decode host/MoE-sync bound. The next parity evidence should be a datacenter
-  Blackwell rerun, or a fresh profile proving a different GB10 bottleneck.
+- Latest attempt: Phase76.
+- Latest decision: current-stack GB10 graph-node MoE serving profile reopened a
+  narrow GDN evidence path. GDN was the largest macro bucket (`32.88%`,
+  `6669.16 ms`) at `n=128`, with `gdn_core` alone `28.97%`. This does not
+  justify backend source yet, but it funds a Phase77 decode/mixed-serving A/B
+  proof for vLLM-style recurrent/packed GDN before any patch.

 ## Current Serving Record

@@ -58,6 +57,71 @@ Decision:

 ## Attempt Log

+### Phase76: Current MoE Serving Graph-Node Profile
+
+- Date: 2026-07-01.
+- Artifact:
+  `/home/mudler/bench/phase76_current_moe_profile/20260701_145116`.
+- Setup-hiccup artifacts:
+  `/home/mudler/bench/phase76_current_moe_profile/20260701_144754` and
+  `/home/mudler/bench/phase76_current_moe_profile/20260701_144929`.
+- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
+- Result type: current-stack llama.cpp graph-node serving profile; no source
+  change.
+- Shape: MoE `q36-35b-a3b-nvfp4`, `n=128`, `PTOK=128`, `GEN=64`,
+  `PARALLEL=128`, `CTX=131072`, production defaults.
+- Profiler: `nsys launch --cuda-graph-trace=node`, bucketed with
+  `/home/mudler/bench/bucket2.py`.
+
+Gates:
+
+| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
+|-------|---------|-----------|-----------|--------------|
+| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+
+Serving result under graph-node profiling:
+
+| n | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | ttft_mean_ms | wall_s |
+|--:|--------:|---------------:|------------------:|------------:|-------------:|-------:|
+| `128` | `204.1` | `320.7` | `2.06` | `1490.1` | `8365.1` | `40.146` |
+
+Macro buckets:
+
+| bucket | time ms | share | instances |
+|--------|--------:|------:|----------:|
+| GDN | `6669.16` | `32.88%` | `25980` |
+| MoE/FFN-GEMM | `6264.88` | `30.88%` | `54406` |
+| bf16/fp8-proj | `2772.38` | `13.67%` | `53880` |
+| layout-copy | `1265.44` | `6.24%` | `81280` |
+| ew-mul(weight/norm/GDN) | `734.61` | `3.62%` | `52464` |
+| act-quant | `678.95` | `3.35%` | `37526` |
+| FA | `264.50` | `1.30%` | `3660` |
+
+Fine buckets:
+
+| bucket | macro | time ms | share | instances |
+|--------|-------|--------:|------:|----------:|
+| `gdn_core` | GDN | `5876.94` | `28.97%` | `4680` |
+| `gdn_conv` | GDN | `454.03` | `2.24%` | `7260` |
+| `gdn_gather` | GDN | `237.87` | `1.17%` | `4680` |
+| `gdn_l2norm` | GDN | `100.32` | `0.49%` | `9360` |
+| `mmq_nvfp4` | MoE/FFN-GEMM | `6055.03` | `29.85%` | `34162` |
+
+Decision:
+
+- Phase76 contradicts the Phase75 assumption that GDN decode is not on the
+  current critical path. Under graph-node current serving, GDN is the largest
+  GPU-kernel macro bucket and `gdn_core` alone is nearly `29%`.
+- Do not patch `gated_delta_net.cu` yet. This profile is llama-only and
+  graph-node tracing depresses absolute throughput, so it is a source-funding
+  signal, not a source patch gate.
+- Fund Phase77 as a narrow proof before backend edits:
+  compare current `gdn_core` against a vLLM-style direct recurrent/packed decode
+  PoC or an in-backend default-off A/B, with pre/post md5 and op gates, and
+  require a material reduction in the Phase76 `gdn_core` bucket without
+  regressing serving throughput or canonical md5.
+
 ### Phase75: Post-PoC GDN/VLLM Audit

 - Date: 2026-07-01.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -1272,10 +1272,34 @@ Phase75 follow-up audit:
  DSL and should not be treated as a portable GB10 patch base until the local
  toolchain proves support.

-Current next gate: run the Phase72 same-session serving shape on B200/B100/GB200
-or equivalent datacenter Blackwell hardware. Require `hardware_class` to be
-`datacenter_blackwell`, pre/post md5 and op gates to be green, and graph-node
-decode profiles for both llama.cpp and vLLM. If the rerun does not materially
-raise the GB10 Phase72 decode ratios (`0.7561`, `0.7158`, `0.6935`), keep GB10
-GDN source work stopped and classify the residual gap as engine/kernel
-architecture rather than GB10 memory bandwidth.
+Phase76 current-stack GB10 graph-node profile:
+
+- Artifact:
+  `/home/mudler/bench/phase76_current_moe_profile/20260701_145116`.
+- Shape: MoE `q36-35b-a3b-nvfp4`, `n=128`, `PTOK=128`, `GEN=64`,
+  `PARALLEL=128`, `CTX=131072`, production defaults.
+- Pre/post gates were green: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense
+  md5 `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`,
+  `MUL_MAT_ID 806/806`.
+- Serving under graph-node profiling: aggregate `204.1 t/s`, decode aggregate
+  `320.7 t/s`, prefill `1490.1 t/s`, TTFT mean `8365.1 ms`, wall `40.146 s`.
+- Bucket result: GDN was the largest macro bucket, `6669.16 ms` (`32.88%`),
+  ahead of MoE/FFN-GEMM `6264.88 ms` (`30.88%`) and BF16 projections
+  `2772.38 ms` (`13.67%`). `gdn_core` alone was `5876.94 ms` (`28.97%`).
+
+This supersedes the Phase75 "datacenter only unless fresh profile" wording:
+Phase76 is that fresh profile. It does **not** justify an immediate backend
+patch because it is llama-only and graph-node tracing depresses absolute
+throughput, but it does fund one narrow GB10 follow-up before waiting for B200:
+prove whether vLLM's direct recurrent/packed decode idea can reduce the current
+`gdn_core` bucket.
+
+Current next gate:
+
+1. Keep the B200/B100/GB200 Phase72 same-session rerun as the hardware-pivot
+   gate when datacenter Blackwell is available.
+2. In parallel on GB10, run a Phase77 GDN decode proof with pre/post md5 and op
+   gates. Accept only if it materially reduces the Phase76 `gdn_core` bucket and
+   does not regress serving throughput or canonical output md5.
+3. Do not merge or default-on any `gated_delta_net.cu` change from this evidence
+   alone; Phase76 is a profile gate, not a source patch gate.