docs(paged): record MoE decode-only profile phase

Assisted-by: Codex:gpt-5
2026-07-02 20:37:03 -04:00 · 2026-07-01 15:05:45 +00:00
parent f21b393746
commit a9454b45c8
2 changed files with 88 additions and 6 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
@@ -12,12 +12,12 @@ with artifact path, gates, benchmark rows, and decision.
 - Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
 - Current tested source: DGX mirror
  `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Latest attempt: Phase76.
- Latest decision: current-stack GB10 graph-node MoE serving profile reopened a
-  narrow GDN evidence path. GDN was the largest macro bucket (`32.88%`,
-  `6669.16 ms`) at `n=128`, with `gdn_core` alone `28.97%`. This does not
-  justify backend source yet, but it funds a Phase77 decode/mixed-serving A/B
-  proof for vLLM-style recurrent/packed GDN before any patch.
+- Latest attempt: Phase77.
+- Latest decision: decode-only GB10 graph-node profile confirms GDN recurrence
+  is a real current decode bucket. In an isolated n=128 decode window, GDN was
+  `41.20%` of GPU kernel time and `gdn_core` alone was `38.95%`, slightly above
+  `mmq_nvfp4` (`38.26%`). This funds a default-off GDN decode A/B/PoC, with
+  md5/op gates and bucket reduction required before any merge/default change.

 ## Current Serving Record

@@ -57,6 +57,65 @@ Decision:

 ## Attempt Log

+### Phase77: MoE Decode-Only Graph-Node Profile
+
+- Date: 2026-07-01.
+- Artifact:
+  `/home/mudler/bench/phase77_moe_decode_only_profile/20260701_150134`.
+- Setup-hiccup artifact:
+  `/home/mudler/bench/phase77_moe_decode_only_profile/20260701_145815`.
+- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
+- Result type: current-stack llama.cpp decode-only graph-node profile; no
+  source change.
+- Shape: MoE `q36-35b-a3b-nvfp4`, `N=128`, long-running `/completion`
+  requests, `N_PREDICT=2048`, capture after active decode.
+- Capture window: active slots `128`; median decoded depth `67` at start and
+  `89` mid-capture; `CAPTURE_SECONDS=4`.
+- Profiler: `nsys launch --cuda-graph-trace=node`, bucketed with
+  `/home/mudler/bench/bucket2.py`.
+
+Gates:
+
+| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
+|-------|---------|-----------|-----------|--------------|
+| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+
+Macro buckets:
+
+| bucket | time ms | share | instances |
+|--------|--------:|------:|----------:|
+| GDN | `1489.71` | `41.20%` | `3600` |
+| MoE/FFN-GEMM | `1400.77` | `38.74%` | `7220` |
+| bf16/fp8-proj | `352.90` | `9.76%` | `7400` |
+| layout-copy | `69.85` | `1.93%` | `10400` |
+| act-quant | `67.63` | `1.87%` | `4820` |
+| FA | `36.74` | `1.02%` | `600` |
+
+Fine buckets:
+
+| bucket | macro | time ms | share | instances |
+|--------|-------|--------:|------:|----------:|
+| `gdn_core` | GDN | `1408.33` | `38.95%` | `600` |
+| `mmq_nvfp4` | MoE/FFN-GEMM | `1383.50` | `38.26%` | `4820` |
+| `gdn_conv` | GDN | `71.76` | `1.98%` | `1200` |
+| `gdn_l2norm` | GDN | `8.81` | `0.24%` | `1200` |
+| `gdn_gather` | GDN | `0.80` | `0.02%` | `600` |
+
+Decision:
+
+- Phase77 confirms Phase76's GDN bucket is not only prompt/prefill
+  contamination. In an isolated decode window, `gdn_core` is the largest fine
+  bucket and is slightly larger than `mmq_nvfp4`.
+- This supersedes the Phase75 no-GB10-GDN-source stance. The source-funded path
+  is no longer C=64 prefill inverse work; it is a narrow default-off GDN decode
+  A/B or standalone PoC based on the direct recurrent/packed decode structure
+  found in vLLM.
+- Acceptance gate for the next source attempt:
+  reduce the Phase77 `gdn_core` bucket materially, keep pre/post md5 and
+  `MUL_MAT`/`MUL_MAT_ID` green, and show no serving/decode throughput
+  regression under the same decode-only capture shape.
+
 ### Phase76: Current MoE Serving Graph-Node Profile

 - Date: 2026-07-01.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -1303,3 +1303,26 @@ Current next gate:
   does not regress serving throughput or canonical output md5.
 3. Do not merge or default-on any `gated_delta_net.cu` change from this evidence
   alone; Phase76 is a profile gate, not a source patch gate.
+
+Phase77 decode-only profile result:
+
+- Artifact:
+  `/home/mudler/bench/phase77_moe_decode_only_profile/20260701_150134`.
+- Shape: MoE `q36-35b-a3b-nvfp4`, `N=128`, long-running `/completion`
+  requests, `N_PREDICT=2048`, capture after active decode.
+- Capture window: active slots `128`; median decoded depth `67` at start and
+  `89` mid-capture.
+- Pre/post gates were green: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense
+  md5 `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`,
+  `MUL_MAT_ID 806/806`.
+- Bucket result: GDN `1489.71 ms` (`41.20%`) and MoE/FFN-GEMM `1400.77 ms`
+  (`38.74%`). Fine bucket `gdn_core` was `1408.33 ms` (`38.95%`), slightly
+  larger than `mmq_nvfp4` at `1383.50 ms` (`38.26%`).
+
+Phase77 supersedes the Phase75 "no GB10 GDN source work" stance for decode
+only. Do **not** reopen the failed C=64 prefill inverse scaffold. The funded
+GB10 source path is now a narrow, default-off GDN decode A/B or standalone PoC
+based on vLLM's direct recurrent/packed decode structure. The next patch must
+prove a material reduction in the Phase77 `gdn_core` bucket, keep canonical md5
+and op gates green, and avoid serving/decode throughput regression under the
+same decode-only capture shape before it can be considered for merge or default.