docs(paged): record MoE decode-only profile phase

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 15:05:45 +00:00
parent f21b393746
commit a9454b45c8
2 changed files with 88 additions and 6 deletions

View File

@@ -12,12 +12,12 @@ with artifact path, gates, benchmark rows, and decision.
- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
- Current tested source: DGX mirror
`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Latest attempt: Phase76.
- Latest decision: current-stack GB10 graph-node MoE serving profile reopened a
narrow GDN evidence path. GDN was the largest macro bucket (`32.88%`,
`6669.16 ms`) at `n=128`, with `gdn_core` alone `28.97%`. This does not
justify backend source yet, but it funds a Phase77 decode/mixed-serving A/B
proof for vLLM-style recurrent/packed GDN before any patch.
- Latest attempt: Phase77.
- Latest decision: decode-only GB10 graph-node profile confirms GDN recurrence
is a real current decode bucket. In an isolated n=128 decode window, GDN was
`41.20%` of GPU kernel time and `gdn_core` alone was `38.95%`, slightly above
`mmq_nvfp4` (`38.26%`). This funds a default-off GDN decode A/B/PoC, with
md5/op gates and bucket reduction required before any merge/default change.
## Current Serving Record
@@ -57,6 +57,65 @@ Decision:
## Attempt Log
### Phase77: MoE Decode-Only Graph-Node Profile
- Date: 2026-07-01.
- Artifact:
`/home/mudler/bench/phase77_moe_decode_only_profile/20260701_150134`.
- Setup-hiccup artifact:
`/home/mudler/bench/phase77_moe_decode_only_profile/20260701_145815`.
- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Result type: current-stack llama.cpp decode-only graph-node profile; no
source change.
- Shape: MoE `q36-35b-a3b-nvfp4`, `N=128`, long-running `/completion`
requests, `N_PREDICT=2048`, capture after active decode.
- Capture window: active slots `128`; median decoded depth `67` at start and
`89` mid-capture; `CAPTURE_SECONDS=4`.
- Profiler: `nsys launch --cuda-graph-trace=node`, bucketed with
`/home/mudler/bench/bucket2.py`.
Gates:
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|-------|---------|-----------|-----------|--------------|
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
Macro buckets:
| bucket | time ms | share | instances |
|--------|--------:|------:|----------:|
| GDN | `1489.71` | `41.20%` | `3600` |
| MoE/FFN-GEMM | `1400.77` | `38.74%` | `7220` |
| bf16/fp8-proj | `352.90` | `9.76%` | `7400` |
| layout-copy | `69.85` | `1.93%` | `10400` |
| act-quant | `67.63` | `1.87%` | `4820` |
| FA | `36.74` | `1.02%` | `600` |
Fine buckets:
| bucket | macro | time ms | share | instances |
|--------|-------|--------:|------:|----------:|
| `gdn_core` | GDN | `1408.33` | `38.95%` | `600` |
| `mmq_nvfp4` | MoE/FFN-GEMM | `1383.50` | `38.26%` | `4820` |
| `gdn_conv` | GDN | `71.76` | `1.98%` | `1200` |
| `gdn_l2norm` | GDN | `8.81` | `0.24%` | `1200` |
| `gdn_gather` | GDN | `0.80` | `0.02%` | `600` |
Decision:
- Phase77 confirms Phase76's GDN bucket is not only prompt/prefill
contamination. In an isolated decode window, `gdn_core` is the largest fine
bucket and is slightly larger than `mmq_nvfp4`.
- This supersedes the Phase75 no-GB10-GDN-source stance. The source-funded path
is no longer C=64 prefill inverse work; it is a narrow default-off GDN decode
A/B or standalone PoC based on the direct recurrent/packed decode structure
found in vLLM.
- Acceptance gate for the next source attempt:
reduce the Phase77 `gdn_core` bucket materially, keep pre/post md5 and
`MUL_MAT`/`MUL_MAT_ID` green, and show no serving/decode throughput
regression under the same decode-only capture shape.
### Phase76: Current MoE Serving Graph-Node Profile
- Date: 2026-07-01.

View File

@@ -1303,3 +1303,26 @@ Current next gate:
does not regress serving throughput or canonical output md5.
3. Do not merge or default-on any `gated_delta_net.cu` change from this evidence
alone; Phase76 is a profile gate, not a source patch gate.
Phase77 decode-only profile result:
- Artifact:
`/home/mudler/bench/phase77_moe_decode_only_profile/20260701_150134`.
- Shape: MoE `q36-35b-a3b-nvfp4`, `N=128`, long-running `/completion`
requests, `N_PREDICT=2048`, capture after active decode.
- Capture window: active slots `128`; median decoded depth `67` at start and
`89` mid-capture.
- Pre/post gates were green: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense
md5 `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`,
`MUL_MAT_ID 806/806`.
- Bucket result: GDN `1489.71 ms` (`41.20%`) and MoE/FFN-GEMM `1400.77 ms`
(`38.74%`). Fine bucket `gdn_core` was `1408.33 ms` (`38.95%`), slightly
larger than `mmq_nvfp4` at `1383.50 ms` (`38.26%`).
Phase77 supersedes the Phase75 "no GB10 GDN source work" stance for decode
only. Do **not** reopen the failed C=64 prefill inverse scaffold. The funded
GB10 source path is now a narrow, default-off GDN decode A/B or standalone PoC
based on vLLM's direct recurrent/packed decode structure. The next patch must
prove a material reduction in the Phase77 `gdn_core` bucket, keep canonical md5
and op gates green, and avoid serving/decode throughput regression under the
same decode-only capture shape before it can be considered for merge or default.