docs(paged): record current MoE graph profile phase

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 14:56:39 +00:00
parent 26a41fad1a
commit f21b393746
2 changed files with 102 additions and 14 deletions

View File

@@ -12,13 +12,12 @@ with artifact path, gates, benchmark rows, and decision.
- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
- Current tested source: DGX mirror
`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Latest attempt: Phase75.
- Latest decision: subagent codebase audit found no source-funded GB10 GDN
backend change. Phase74 rejects the C=64 inverse scaffold; vLLM's distinct
one-token recurrent decode path is not on the current llama.cpp critical path
because prior profiles showed GDN decode already faster than vLLM and serving
decode host/MoE-sync bound. The next parity evidence should be a datacenter
Blackwell rerun, or a fresh profile proving a different GB10 bottleneck.
- Latest attempt: Phase76.
- Latest decision: current-stack GB10 graph-node MoE serving profile reopened a
narrow GDN evidence path. GDN was the largest macro bucket (`32.88%`,
`6669.16 ms`) at `n=128`, with `gdn_core` alone `28.97%`. This does not
justify backend source yet, but it funds a Phase77 decode/mixed-serving A/B
proof for vLLM-style recurrent/packed GDN before any patch.
## Current Serving Record
@@ -58,6 +57,71 @@ Decision:
## Attempt Log
### Phase76: Current MoE Serving Graph-Node Profile
- Date: 2026-07-01.
- Artifact:
`/home/mudler/bench/phase76_current_moe_profile/20260701_145116`.
- Setup-hiccup artifacts:
`/home/mudler/bench/phase76_current_moe_profile/20260701_144754` and
`/home/mudler/bench/phase76_current_moe_profile/20260701_144929`.
- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Result type: current-stack llama.cpp graph-node serving profile; no source
change.
- Shape: MoE `q36-35b-a3b-nvfp4`, `n=128`, `PTOK=128`, `GEN=64`,
`PARALLEL=128`, `CTX=131072`, production defaults.
- Profiler: `nsys launch --cuda-graph-trace=node`, bucketed with
`/home/mudler/bench/bucket2.py`.
Gates:
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|-------|---------|-----------|-----------|--------------|
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
Serving result under graph-node profiling:
| n | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | ttft_mean_ms | wall_s |
|--:|--------:|---------------:|------------------:|------------:|-------------:|-------:|
| `128` | `204.1` | `320.7` | `2.06` | `1490.1` | `8365.1` | `40.146` |
Macro buckets:
| bucket | time ms | share | instances |
|--------|--------:|------:|----------:|
| GDN | `6669.16` | `32.88%` | `25980` |
| MoE/FFN-GEMM | `6264.88` | `30.88%` | `54406` |
| bf16/fp8-proj | `2772.38` | `13.67%` | `53880` |
| layout-copy | `1265.44` | `6.24%` | `81280` |
| ew-mul(weight/norm/GDN) | `734.61` | `3.62%` | `52464` |
| act-quant | `678.95` | `3.35%` | `37526` |
| FA | `264.50` | `1.30%` | `3660` |
Fine buckets:
| bucket | macro | time ms | share | instances |
|--------|-------|--------:|------:|----------:|
| `gdn_core` | GDN | `5876.94` | `28.97%` | `4680` |
| `gdn_conv` | GDN | `454.03` | `2.24%` | `7260` |
| `gdn_gather` | GDN | `237.87` | `1.17%` | `4680` |
| `gdn_l2norm` | GDN | `100.32` | `0.49%` | `9360` |
| `mmq_nvfp4` | MoE/FFN-GEMM | `6055.03` | `29.85%` | `34162` |
Decision:
- Phase76 contradicts the Phase75 assumption that GDN decode is not on the
current critical path. Under graph-node current serving, GDN is the largest
GPU-kernel macro bucket and `gdn_core` alone is nearly `29%`.
- Do not patch `gated_delta_net.cu` yet. This profile is llama-only and
graph-node tracing depresses absolute throughput, so it is a source-funding
signal, not a source patch gate.
- Fund Phase77 as a narrow proof before backend edits:
compare current `gdn_core` against a vLLM-style direct recurrent/packed decode
PoC or an in-backend default-off A/B, with pre/post md5 and op gates, and
require a material reduction in the Phase76 `gdn_core` bucket without
regressing serving throughput or canonical md5.
### Phase75: Post-PoC GDN/VLLM Audit
- Date: 2026-07-01.

View File

@@ -1272,10 +1272,34 @@ Phase75 follow-up audit:
DSL and should not be treated as a portable GB10 patch base until the local
toolchain proves support.
Current next gate: run the Phase72 same-session serving shape on B200/B100/GB200
or equivalent datacenter Blackwell hardware. Require `hardware_class` to be
`datacenter_blackwell`, pre/post md5 and op gates to be green, and graph-node
decode profiles for both llama.cpp and vLLM. If the rerun does not materially
raise the GB10 Phase72 decode ratios (`0.7561`, `0.7158`, `0.6935`), keep GB10
GDN source work stopped and classify the residual gap as engine/kernel
architecture rather than GB10 memory bandwidth.
Phase76 current-stack GB10 graph-node profile:
- Artifact:
`/home/mudler/bench/phase76_current_moe_profile/20260701_145116`.
- Shape: MoE `q36-35b-a3b-nvfp4`, `n=128`, `PTOK=128`, `GEN=64`,
`PARALLEL=128`, `CTX=131072`, production defaults.
- Pre/post gates were green: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense
md5 `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`,
`MUL_MAT_ID 806/806`.
- Serving under graph-node profiling: aggregate `204.1 t/s`, decode aggregate
`320.7 t/s`, prefill `1490.1 t/s`, TTFT mean `8365.1 ms`, wall `40.146 s`.
- Bucket result: GDN was the largest macro bucket, `6669.16 ms` (`32.88%`),
ahead of MoE/FFN-GEMM `6264.88 ms` (`30.88%`) and BF16 projections
`2772.38 ms` (`13.67%`). `gdn_core` alone was `5876.94 ms` (`28.97%`).
This supersedes the Phase75 "datacenter only unless fresh profile" wording:
Phase76 is that fresh profile. It does **not** justify an immediate backend
patch because it is llama-only and graph-node tracing depresses absolute
throughput, but it does fund one narrow GB10 follow-up before waiting for B200:
prove whether vLLM's direct recurrent/packed decode idea can reduce the current
`gdn_core` bucket.
Current next gate:
1. Keep the B200/B100/GB200 Phase72 same-session rerun as the hardware-pivot
gate when datacenter Blackwell is available.
2. In parallel on GB10, run a Phase77 GDN decode proof with pre/post md5 and op
gates. Accept only if it materially reduces the Phase76 `gdn_core` bucket and
does not regress serving throughput or canonical output md5.
3. Do not merge or default-on any `gated_delta_net.cu` change from this evidence
alone; Phase76 is a profile gate, not a source patch gate.