mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): record current MoE graph profile phase
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -12,13 +12,12 @@ with artifact path, gates, benchmark rows, and decision.
|
||||
- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Current tested source: DGX mirror
|
||||
`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
|
||||
- Latest attempt: Phase75.
|
||||
- Latest decision: subagent codebase audit found no source-funded GB10 GDN
|
||||
backend change. Phase74 rejects the C=64 inverse scaffold; vLLM's distinct
|
||||
one-token recurrent decode path is not on the current llama.cpp critical path
|
||||
because prior profiles showed GDN decode already faster than vLLM and serving
|
||||
decode host/MoE-sync bound. The next parity evidence should be a datacenter
|
||||
Blackwell rerun, or a fresh profile proving a different GB10 bottleneck.
|
||||
- Latest attempt: Phase76.
|
||||
- Latest decision: current-stack GB10 graph-node MoE serving profile reopened a
|
||||
narrow GDN evidence path. GDN was the largest macro bucket (`32.88%`,
|
||||
`6669.16 ms`) at `n=128`, with `gdn_core` alone `28.97%`. This does not
|
||||
justify backend source yet, but it funds a Phase77 decode/mixed-serving A/B
|
||||
proof for vLLM-style recurrent/packed GDN before any patch.
|
||||
|
||||
## Current Serving Record
|
||||
|
||||
@@ -58,6 +57,71 @@ Decision:
|
||||
|
||||
## Attempt Log
|
||||
|
||||
### Phase76: Current MoE Serving Graph-Node Profile
|
||||
|
||||
- Date: 2026-07-01.
|
||||
- Artifact:
|
||||
`/home/mudler/bench/phase76_current_moe_profile/20260701_145116`.
|
||||
- Setup-hiccup artifacts:
|
||||
`/home/mudler/bench/phase76_current_moe_profile/20260701_144754` and
|
||||
`/home/mudler/bench/phase76_current_moe_profile/20260701_144929`.
|
||||
- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
|
||||
- Result type: current-stack llama.cpp graph-node serving profile; no source
|
||||
change.
|
||||
- Shape: MoE `q36-35b-a3b-nvfp4`, `n=128`, `PTOK=128`, `GEN=64`,
|
||||
`PARALLEL=128`, `CTX=131072`, production defaults.
|
||||
- Profiler: `nsys launch --cuda-graph-trace=node`, bucketed with
|
||||
`/home/mudler/bench/bucket2.py`.
|
||||
|
||||
Gates:
|
||||
|
||||
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|
||||
|-------|---------|-----------|-----------|--------------|
|
||||
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
|
||||
Serving result under graph-node profiling:
|
||||
|
||||
| n | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | ttft_mean_ms | wall_s |
|
||||
|--:|--------:|---------------:|------------------:|------------:|-------------:|-------:|
|
||||
| `128` | `204.1` | `320.7` | `2.06` | `1490.1` | `8365.1` | `40.146` |
|
||||
|
||||
Macro buckets:
|
||||
|
||||
| bucket | time ms | share | instances |
|
||||
|--------|--------:|------:|----------:|
|
||||
| GDN | `6669.16` | `32.88%` | `25980` |
|
||||
| MoE/FFN-GEMM | `6264.88` | `30.88%` | `54406` |
|
||||
| bf16/fp8-proj | `2772.38` | `13.67%` | `53880` |
|
||||
| layout-copy | `1265.44` | `6.24%` | `81280` |
|
||||
| ew-mul(weight/norm/GDN) | `734.61` | `3.62%` | `52464` |
|
||||
| act-quant | `678.95` | `3.35%` | `37526` |
|
||||
| FA | `264.50` | `1.30%` | `3660` |
|
||||
|
||||
Fine buckets:
|
||||
|
||||
| bucket | macro | time ms | share | instances |
|
||||
|--------|-------|--------:|------:|----------:|
|
||||
| `gdn_core` | GDN | `5876.94` | `28.97%` | `4680` |
|
||||
| `gdn_conv` | GDN | `454.03` | `2.24%` | `7260` |
|
||||
| `gdn_gather` | GDN | `237.87` | `1.17%` | `4680` |
|
||||
| `gdn_l2norm` | GDN | `100.32` | `0.49%` | `9360` |
|
||||
| `mmq_nvfp4` | MoE/FFN-GEMM | `6055.03` | `29.85%` | `34162` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Phase76 contradicts the Phase75 assumption that GDN decode is not on the
|
||||
current critical path. Under graph-node current serving, GDN is the largest
|
||||
GPU-kernel macro bucket and `gdn_core` alone is nearly `29%`.
|
||||
- Do not patch `gated_delta_net.cu` yet. This profile is llama-only and
|
||||
graph-node tracing depresses absolute throughput, so it is a source-funding
|
||||
signal, not a source patch gate.
|
||||
- Fund Phase77 as a narrow proof before backend edits:
|
||||
compare current `gdn_core` against a vLLM-style direct recurrent/packed decode
|
||||
PoC or an in-backend default-off A/B, with pre/post md5 and op gates, and
|
||||
require a material reduction in the Phase76 `gdn_core` bucket without
|
||||
regressing serving throughput or canonical md5.
|
||||
|
||||
### Phase75: Post-PoC GDN/VLLM Audit
|
||||
|
||||
- Date: 2026-07-01.
|
||||
|
||||
@@ -1272,10 +1272,34 @@ Phase75 follow-up audit:
|
||||
DSL and should not be treated as a portable GB10 patch base until the local
|
||||
toolchain proves support.
|
||||
|
||||
Current next gate: run the Phase72 same-session serving shape on B200/B100/GB200
|
||||
or equivalent datacenter Blackwell hardware. Require `hardware_class` to be
|
||||
`datacenter_blackwell`, pre/post md5 and op gates to be green, and graph-node
|
||||
decode profiles for both llama.cpp and vLLM. If the rerun does not materially
|
||||
raise the GB10 Phase72 decode ratios (`0.7561`, `0.7158`, `0.6935`), keep GB10
|
||||
GDN source work stopped and classify the residual gap as engine/kernel
|
||||
architecture rather than GB10 memory bandwidth.
|
||||
Phase76 current-stack GB10 graph-node profile:
|
||||
|
||||
- Artifact:
|
||||
`/home/mudler/bench/phase76_current_moe_profile/20260701_145116`.
|
||||
- Shape: MoE `q36-35b-a3b-nvfp4`, `n=128`, `PTOK=128`, `GEN=64`,
|
||||
`PARALLEL=128`, `CTX=131072`, production defaults.
|
||||
- Pre/post gates were green: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
md5 `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`,
|
||||
`MUL_MAT_ID 806/806`.
|
||||
- Serving under graph-node profiling: aggregate `204.1 t/s`, decode aggregate
|
||||
`320.7 t/s`, prefill `1490.1 t/s`, TTFT mean `8365.1 ms`, wall `40.146 s`.
|
||||
- Bucket result: GDN was the largest macro bucket, `6669.16 ms` (`32.88%`),
|
||||
ahead of MoE/FFN-GEMM `6264.88 ms` (`30.88%`) and BF16 projections
|
||||
`2772.38 ms` (`13.67%`). `gdn_core` alone was `5876.94 ms` (`28.97%`).
|
||||
|
||||
This supersedes the Phase75 "datacenter only unless fresh profile" wording:
|
||||
Phase76 is that fresh profile. It does **not** justify an immediate backend
|
||||
patch because it is llama-only and graph-node tracing depresses absolute
|
||||
throughput, but it does fund one narrow GB10 follow-up before waiting for B200:
|
||||
prove whether vLLM's direct recurrent/packed decode idea can reduce the current
|
||||
`gdn_core` bucket.
|
||||
|
||||
Current next gate:
|
||||
|
||||
1. Keep the B200/B100/GB200 Phase72 same-session rerun as the hardware-pivot
|
||||
gate when datacenter Blackwell is available.
|
||||
2. In parallel on GB10, run a Phase77 GDN decode proof with pre/post md5 and op
|
||||
gates. Accept only if it materially reduces the Phase76 `gdn_core` bucket and
|
||||
does not regress serving throughput or canonical output md5.
|
||||
3. Do not merge or default-on any `gated_delta_net.cu` change from this evidence
|
||||
alone; Phase76 is a profile gate, not a source patch gate.
|
||||
|
||||
Reference in New Issue
Block a user