mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): record graph-node serving profile
Record the Phase 27 current-stack llama.cpp n128 serving profile captured with CUDA graph node tracing and gated before and after the run. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -1637,3 +1637,81 @@ Decision:
|
||||
- Current paged/vLLM decode ratios remain about `81.5%` at n8, `69.0%` at n32,
|
||||
and `65.7%` at n128; e2e aggregate ratios remain about `70.6%`, `54.6%`,
|
||||
and `49.4%`.
|
||||
|
||||
## Phase 27 Graph-Node-Traced Current-Stack Serving Profile
|
||||
|
||||
Phase 27 re-profiled the current clean llama.cpp serving path with CUDA graph
|
||||
node tracing enabled. This checks the Phase 8 bucket picture against the decode
|
||||
profiling rule: serving/decode profiles must use `--cuda-graph-trace=node`.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase27_graph_node_serving/20260701_055519`
|
||||
|
||||
Source and hardware:
|
||||
|
||||
- `/home/mudler/llama-phase6-source`
|
||||
- `f2521ab12 feat(server): trace speculative batch shapes`
|
||||
- `GPU 0: NVIDIA GB10`, driver `580.159.03`, compute capability `12.1`
|
||||
- Nsight Systems `2025.3.2.474-253236389321v0`
|
||||
|
||||
Safety gates:
|
||||
|
||||
| phase | check | status | actual |
|
||||
|-------|-------|--------|--------|
|
||||
| pre | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
|
||||
| pre | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
|
||||
| pre | `MUL_MAT_ID` | ok | `806/806` |
|
||||
| post retry | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
|
||||
| post retry | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
|
||||
| post retry | `MUL_MAT_ID` | ok | `806/806` |
|
||||
|
||||
The first immediate post-gate attempt raced with Nsight teardown and rejected
|
||||
the run because it detected one compute process even though `nvidia-smi` already
|
||||
printed no running processes. The post-gate retry started from `docker=0`,
|
||||
`local_ai_worker=0`, `compute=0`, and a `FREE` owner file.
|
||||
|
||||
Serving sample (`n=128`, `PTOK=128`, `GEN=64`):
|
||||
|
||||
| agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | TTFT mean ms |
|
||||
|---------|----------------|--------------------|-------------|--------------|
|
||||
| 319.9 | 675.5 | 3.9 | 1671.1 | 8363.4 |
|
||||
|
||||
This matches Phase 26's n128 paged decode rate (`673.4` decode_agg_tps) closely
|
||||
enough to treat the profile as representative for bucket direction.
|
||||
|
||||
Graph-node-traced kernel buckets:
|
||||
|
||||
| macro bucket | time ms | share |
|
||||
|--------------|---------|-------|
|
||||
| GDN | 6706.33 | 33.47% |
|
||||
| MoE/FFN-GEMM | 5871.92 | 29.31% |
|
||||
| bf16-proj | 2725.07 | 13.60% |
|
||||
| layout-copy | 1309.99 | 6.54% |
|
||||
| ew-mul(weight/norm/GDN) | 724.29 | 3.61% |
|
||||
| act-quant | 697.75 | 3.48% |
|
||||
| norms/residual | 405.29 | 2.02% |
|
||||
| ew-add(resid/MoE-fanin) | 361.81 | 1.81% |
|
||||
| MoE-dispatch | 275.99 | 1.38% |
|
||||
| FA | 271.03 | 1.35% |
|
||||
|
||||
Fine buckets:
|
||||
|
||||
- `gdn_core`: `5929.85 ms` (`29.59%`)
|
||||
- `mmq_nvfp4`: `5697.79 ms` (`28.44%`)
|
||||
- `cublas_bf16_gemm`: `1892.81 ms` (`9.45%`)
|
||||
- `act_quant`: `697.75 ms` (`3.48%`)
|
||||
- `mm_ids`: `121.99 ms` (`0.61%`)
|
||||
- `gather_mmq`: `73.88 ms` (`0.37%`)
|
||||
- `argsort_topk`: `80.11 ms` (`0.40%`)
|
||||
|
||||
Decision:
|
||||
|
||||
- The graph-node-traced current-stack profile confirms the Phase 8 source
|
||||
shortcut decision. Metadata/helper work is still too small: `mm_ids`,
|
||||
`gather_mmq`, and `argsort_topk` together are about `1.38%`.
|
||||
- A credible GB10 source patch would have to reduce `gdn_core` or
|
||||
`mmq_nvfp4`/bf16 projection work directly. The low-conflict helper-dispatch
|
||||
path still should not be reopened.
|
||||
- The serving profile does not change the Phase 26 parity verdict: n128 paged
|
||||
decode remains about `675 tok/s`, far below vLLM's same-session `1025 tok/s`.
|
||||
|
||||
@@ -364,6 +364,18 @@ Use Phase 26 as the current audit-grade GB10 snapshot. It keeps the Phase 20
|
||||
verdict intact, but the artifact is more useful for future regressions because
|
||||
it carries hardware classification and compact pre/post inference gates.
|
||||
|
||||
Phase 27 re-profiled the current clean llama.cpp n128 serving path with
|
||||
`nsys --cuda-graph-trace=node`. Artifact:
|
||||
`/home/mudler/bench/phase27_graph_node_serving/20260701_055519`. The run matched
|
||||
Phase 26 throughput closely (`675.5` vs `673.4` decode_agg_tps) and kept gates
|
||||
green before and after the profile (post retry): MoE md5
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. The node-traced
|
||||
buckets still put the work in `gdn_core` (`29.59%`) and `mmq_nvfp4` (`28.44%`);
|
||||
helper dispatch remains too small (`mm_ids` `0.61%`, `gather_mmq` `0.37%`,
|
||||
`argsort_topk` `0.40%`). Do not reopen metadata/helper-only MoE dispatch work on
|
||||
GB10.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
@@ -430,6 +442,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
- `~/bench/phase24_hardware_report_dryrun/20260701_052741` - current snapshot harness dry run proving `hardware.txt` captures the DGX as `hardware_class=gb10_or_workstation_blackwell`.
|
||||
- `~/bench/phase25_gate_summary_dryrun/20260701_053353` - dry run after adding `gate_summary.tsv` support; normal dry-run still writes `hardware.txt` and does not emit a gate summary before gates exist.
|
||||
- `~/bench/phase26_audited_snapshot/20260701_053650` - current audit-grade full paged-vs-vLLM MoE serving snapshot with `hardware.txt`, pre/post gates, `summary.tsv`, and `gate_summary.tsv`.
|
||||
- `~/bench/phase27_graph_node_serving/20260701_055519` - current clean llama.cpp n128 serving profile captured with `--cuda-graph-trace=node`, pre/post retry gates green.
|
||||
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
|
||||
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
|
||||
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
|
||||
|
||||
@@ -716,6 +716,36 @@ parity on GB10. Treat Phase 26 as the current benchmark baseline before funding
|
||||
new kernel work, and keep md5/op gates as the first check when changing the
|
||||
patch stack.
|
||||
|
||||
### Phase 27 graph-node-traced current-stack profile
|
||||
|
||||
Phase 27 re-profiled the current clean llama.cpp n128 serving path with
|
||||
`--cuda-graph-trace=node`, using the same source (`f2521ab12`) and GB10 host.
|
||||
Artifact: `/home/mudler/bench/phase27_graph_node_serving/20260701_055519`.
|
||||
|
||||
The profile run itself reported `decode_agg_tps=675.5`, close to Phase 26's
|
||||
n128 paged `673.4`, so the trace is representative for bucket direction. Pre
|
||||
gates passed, and the post-gate retry passed after Nsight teardown finished:
|
||||
MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
|
||||
|
||||
Graph-node-traced macro buckets:
|
||||
|
||||
| bucket | time ms | share |
|
||||
|--------|---------|-------|
|
||||
| GDN | 6706.33 | 33.47% |
|
||||
| MoE/FFN-GEMM | 5871.92 | 29.31% |
|
||||
| bf16-proj | 2725.07 | 13.60% |
|
||||
| layout-copy | 1309.99 | 6.54% |
|
||||
| act-quant | 697.75 | 3.48% |
|
||||
| MoE-dispatch | 275.99 | 1.38% |
|
||||
| FA | 271.03 | 1.35% |
|
||||
|
||||
Fine rows keep the same decision shape as Phase 8: `gdn_core` is `29.59%`,
|
||||
`mmq_nvfp4` is `28.44%`, while `mm_ids` is `0.61%`, `gather_mmq` is `0.37%`,
|
||||
and `argsort_topk` is `0.40%`. Do not reopen metadata/helper-only MoE dispatch
|
||||
work on GB10. Any credible source patch must directly reduce GDN, grouped-MMQ,
|
||||
or projection work and still pass the md5/op gates.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
Reference in New Issue
Block a user