docs(paged): A.2 final synthesis - CUDA-graph decode verdict

Append the four-point synthesis to A2_CUDAGRAPH_DECODE.md: measured
CUDA-graph lever size (<1%, not the guessed 10-20%), the corrected
'eager' premise (default paged decode already captures), the unchanged
37-38% of vLLM at npl128, and the honest verdict that A.2 closes none of
the 2.6x gap because paged attention touches ~0.4% of decode on this
hybrid-SSM model. Residual lever is the qwen35 gated-DeltaNet SSM path
(state D2D copy + get_rows gather), orthogonal to paged attention.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-24 21:45:42 +00:00
parent 2dd5d68e6d
commit 34cadb64af

View File

@@ -295,3 +295,53 @@ attention. `patches/paged/0018` stays free. Evidence on the DGX:
`~/bench/a2_decompose/decode_decomp.txt` (per-kernel table + reproducing SQL in
its header), `~/bench/a2_decompose/SUMMARY.txt`, and the Phase-1 reps
`~/bench/a2_nsys/paged_off_npl128.sqlite` / `paged_on_npl128_node.sqlite`.
# A.2 final synthesis - the four-point verdict
All numbers measured on the DGX (GB10, sm_121, q36-27b-nvfp4 dense, fusion OFF,
`decode_agg` = `S_TG t/s`), npl 128 unless noted.
**1. CUDA-graph lever size (measured, not guessed).** +0.13% (4-cell, stock
ON-vs-OFF) to +0.78% (fresh paged re-check) at npl 128; +1.1% to +1.4% at npl 32.
All inside run-to-run noise. The earlier grounding GUESSED ~10-20% from a
94.6%-busy reading; direct measurement puts the steady decode at 99.4-99.5% busy,
so the real graph ceiling is < 1%, not 10-20%. The guess was wrong because the
busy-fraction it rested on was under-read (a graphs-ON nsys trace under-counts
GPU-busy unless `--cuda-graph-trace=node` is set - trap #2).
**2. Was "paged decode runs eager" fixed, and what is the decode_agg win?**
There was nothing to fix: the premise was false. At the benchmarked context the
DEFAULT in-kernel paged decode already captures and replays graphs, with a
256-token reset cadence identical to stock non-paged (10 complete / 8 reset over
~320 steps, resets clustered only at the 256/512 token boundaries). "graphs
reused=0" was a uid fast-path false negative, not eager execution (trap #1). The
only genuinely-eager path is the `LLAMA_KV_PAGED_GATHER=1` fallback (unpadded
index grows every step), which is not the default decode. Because graphs were
already engaged, the decode_agg win from "enabling" them is ~0 (+0.1% to +0.8%).
Graphs DID collapse within-step launch idle (0.37% -> 0.11%, ~80k -> ~15k
launches/run), but the GPU stays 99.4-99.5% busy, so throughput is unchanged.
**3. New llama %-of-vLLM @npl128.** Unchanged by A.2: 146-148.6 t/s vs vLLM 391 =
**37.3-38.0%**. Graphs ON vs OFF both land here (146.03 / 144.90 in the fresh
re-check; 148.41 / 148.21 in the 4-cell). A.2 did not move the percentage.
**4. Honest verdict - did A.2 move toward parity; residual + next lever.** No.
A.2 closed zero of the 2.6x gap, and it provably cannot on this model: paged /
full attention is ~0.4% of decode (16 full-attention layers vs 48 gated-DeltaNet
layers, a 3:1 hybrid SSM), so no graph / block-table / gather change to the paged
path can move decode_agg. The residual gap is structural and lives elsewhere:
~67% of decode is gated-DeltaNet / SSM state plumbing (23.4% recurrence + 21.9%
get_rows state gather + 18.9% D2D recurrent-state copy of ~230 MB per SSM layer
per step, ~18 GB/step), and ~28% is FP4 matmul (already shown secondary by Track
B: a 26%-faster GEMM left decode_agg flat). The within-step launch loop is solved
(graphs) and the between-step host loop is a 0.24% second-order floor hidden under
GPU compute - neither is the residual.
The next lever is NOT in this track. It is the ggml qwen35 gated-DeltaNet decode:
(1) eliminate the per-layer recurrent-state D2D copy (in-place / double-buffer
write-back), and (2) fuse the get_rows gather into the recurrent kernel - mirroring
vLLM's `fused_recurrent_gated_delta_rule`, which keeps the state in place and
fuses the gather. Measured ceiling on this rep: remove the copy -> ~146 to ~180
t/s; remove copy + gather -> ~146 to ~247 t/s (within ~1.6x of vLLM with FP4 GEMM
still untouched). That work is orthogonal to paged attention; `patches/paged/0018`
stays free.