mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): record dense true decode profile
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -2769,3 +2769,67 @@ Verification:
|
||||
and a DGX dense dry-run with `VLLM_READY_ATTEMPTS=700`.
|
||||
- DGX dry-run artifact:
|
||||
`/home/mudler/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138`.
|
||||
|
||||
## Phase 50 Dense True Decode Profile
|
||||
|
||||
Phase 50 separates dense high-concurrency decode from the Phase47 h2h serving
|
||||
window. The Phase47 h2h `decode_agg_tps` metric can count tokens generated by
|
||||
early requests while later requests are still in prefill, then divide by a
|
||||
window that starts at the last first-token. That is useful serving telemetry,
|
||||
but it is not a pure steady-decode measurement.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase50_dense_true_decode/20260701_103120`
|
||||
|
||||
Preflight:
|
||||
|
||||
- Docker containers: `0`
|
||||
- `local-ai-worker`: `0`
|
||||
- GPU compute apps: `0`
|
||||
- GPU: `NVIDIA GB10`, driver `580.159.03`
|
||||
|
||||
Inference gates:
|
||||
|
||||
| phase | build | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|
||||
|-------|-------|---------|-----------|-----------|--------------|
|
||||
| pre | `build-phase36` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| pre | `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| post | `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
|
||||
`build-phase36/bin` had the completion and op-test binaries but not
|
||||
`llama-batched-bench`, so the actual profiled llama.cpp decode binary came from
|
||||
`~/llama-phase6-source/build-cuda/bin`. That build was gated before and after
|
||||
the profile.
|
||||
|
||||
Profile method:
|
||||
|
||||
- Shape: dense Qwen3.5, `npl=128`, `npp=128`, `ntg=16` and `ntg=64`.
|
||||
- Paged command: `llama-batched-bench` with `LLAMA_KV_PAGED=1`,
|
||||
`LLAMA_MOE_FORCE_GRAPHS=1`, `-c 131072 -b 2048 -ub 512 -ngl 99 -fa on`.
|
||||
- vLLM command: in-process `LLM.generate`, `max_model_len=4096`,
|
||||
`max_num_seqs=256`, `gpu_memory_utilization=0.85`, prefix caching disabled.
|
||||
- Both profiles used `nsys --cuda-graph-trace=node`.
|
||||
- Difference method: `(ntg64 tokens - ntg16 tokens) / (ntg64 wall - ntg16 wall)`.
|
||||
|
||||
Results:
|
||||
|
||||
| engine | ntg16 wall s | ntg64 wall s | delta tokens | delta wall s | true decode t/s |
|
||||
|--------|--------------|--------------|--------------|--------------|-----------------|
|
||||
| paged | `5.754` | `21.768` | `6144` | `16.014` | `383.66` |
|
||||
| vLLM | `13.041` | `27.165` | `6144` | `14.124` | `435.00` |
|
||||
| ratio | | | | | `0.8820` |
|
||||
|
||||
Interpretation:
|
||||
|
||||
- Dense true decode at `n=128` is about `88.2%` of vLLM, not the `79.1%`
|
||||
implied by Phase47 h2h aggregate decode. The Phase47 serving window therefore
|
||||
includes real scheduler/accounting effects in addition to GPU decode speed.
|
||||
- There is still a real dense GPU-steady decode gap of about `12%`, but it is
|
||||
not large enough to explain Phase47 aggregate serving (`50.7%` of vLLM) or
|
||||
TTFT (`3.20x` vLLM) by itself.
|
||||
- The next low-conflict code phase should add an opt-in serving
|
||||
batch-composition/admission trace around `server_context::pre_decode()` to
|
||||
measure decode tokens admitted, prompt tokens admitted, waiting prompt slots,
|
||||
graph reuse, and prefill starvation. Do not start with another GDN or GEMM
|
||||
rewrite unless that trace rules the scheduler out.
|
||||
|
||||
@@ -613,6 +613,19 @@ Phase 49 removes vLLM log noise from harness-owned environment variables. The
|
||||
preserving intentional vLLM runtime variables such as `VLLM_LOGGING_LEVEL`. Dry
|
||||
run: `/home/mudler/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138`.
|
||||
|
||||
Phase 50 resolves the dense high-N decode-accounting question with a graph-node
|
||||
difference-method profile. Artifact:
|
||||
`/home/mudler/bench/phase50_dense_true_decode/20260701_103120`. Pre/post
|
||||
inference gates on the profiled `build-cuda` binary stayed green: MoE
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
|
||||
`806/806`. Dense `npl=128`, `npp=128` true decode is `383.66 t/s` for paged and
|
||||
`435.00 t/s` for vLLM, ratio `0.8820`. This means Phase47's `0.7912` h2h
|
||||
decode ratio and `0.5071` aggregate ratio include scheduler/admission and
|
||||
prefill-overlap/accounting effects beyond the real GPU-steady decode gap. Next
|
||||
GB10 code work should instrument batch composition/admission in
|
||||
`server_context::pre_decode()` before attempting another kernel shortcut.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
@@ -705,6 +718,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
- `~/bench/phase48_readiness_harness_dryrun/20260701_100533` - harness dry-run proving configurable readiness budgets and clean preflight before retrying dense serving.
|
||||
- `~/bench/phase47_dense_serving_retry/20260701_100811` - completed dense serving snapshot after Phase48; pre/post md5 and op gates green; paged low-N decode ahead, high-N aggregate and TTFT behind.
|
||||
- `~/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138` - harness dry-run after scrubbing harness-owned `VLLM_*` variables from the `vllm serve` child environment.
|
||||
- `~/bench/phase50_dense_true_decode/20260701_103120` - dense graph-node difference-method profile at `npl=128`, `npp=128`; `build-cuda` pre/post md5 and op gates green; true decode paged `383.66 t/s`, vLLM `435.00 t/s`, ratio `0.8820`, pointing next at serving admission/scheduler tracing.
|
||||
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
|
||||
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
|
||||
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
|
||||
|
||||
@@ -1186,6 +1186,36 @@ Decision: dense low-N decode remains a real paged strength, but dense serving
|
||||
still does not close GB10 parity because TTFT and high-concurrency aggregate
|
||||
throughput remain substantially behind vLLM.
|
||||
|
||||
### Phase 50 dense true decode profile
|
||||
|
||||
Phase50 profiles dense `npl=128`, `npp=128` decode with graph nodes expanded and
|
||||
uses the difference method (`ntg=64 - ntg=16`) instead of the Phase47 h2h
|
||||
serving window. Artifact:
|
||||
`/home/mudler/bench/phase50_dense_true_decode/20260701_103120`.
|
||||
|
||||
Pre/post inference gates stayed green on the profiled `build-cuda` binary set:
|
||||
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and
|
||||
`MUL_MAT_ID` `806/806`. A `build-phase36` pre-gate also passed, but
|
||||
`build-phase36` did not contain `llama-batched-bench`, so `build-cuda` is the
|
||||
profiled/gated build for this phase.
|
||||
|
||||
Results:
|
||||
|
||||
| engine | ntg16 wall s | ntg64 wall s | delta tokens | delta wall s | true decode t/s |
|
||||
|--------|--------------|--------------|--------------|--------------|-----------------|
|
||||
| paged | `5.754` | `21.768` | `6144` | `16.014` | `383.66` |
|
||||
| vLLM | `13.041` | `27.165` | `6144` | `14.124` | `435.00` |
|
||||
| ratio | | | | | `0.8820` |
|
||||
|
||||
Decision: Phase47's dense high-N serving loss is not just a kernel-speed gap.
|
||||
True dense decode is still behind vLLM by about `12%`, but the Phase47 h2h
|
||||
decode ratio at `n=128` was `0.7912` and aggregate serving was only `0.5071`.
|
||||
The remaining difference points at scheduler/admission, prefill overlap, and
|
||||
TTFT accounting. Next implementation target should be an opt-in
|
||||
batch-composition/admission trace in `server_context::pre_decode()` before any
|
||||
new GDN/GEMM shortcut.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
Reference in New Issue
Block a user