From 2dd5d68e6de4e1613dc95c4e0f0c5e5828e8c961 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 24 Jun 2026 21:44:22 +0000 Subject: [PATCH] docs(paged): A.2 Phase 2 - locate the real decode lever (gated-DeltaNet SSM path) Phase 1 ruled out CUDA graphs as the paged-decode lever (GPU 99.4% busy, decode_agg flat graphs on-vs-off) and attributed the 2.6x gap to vLLM to the per-step GPU kernel work (FP4 GEMM + attention at batch 128). Phase 2 decomposed that kernel work directly on the Phase-1 nsys reps and corrects the attribution. Findings (q36-27b-nvfp4 = gguf arch qwen35, a 48:16 hybrid gated-DeltaNet linear-attention + full-attention model; DGX GB10 sm_121, fusion off): - Graphs re-confirmed not the lever: fresh paged graphs-ON 146.03 vs OFF 144.90 t/s (+0.78%, noise); the captured rep is 99.5% busy with the same ~3267ms memcpy (graphs capture memcpy nodes too). - The 99.4% busy is real but ~19% of it is D2D memcpy, not compute: an overlap-correct interval-union sweep gives kernels-only 80.2% busy, the gap filled by 1584 D2D copies/run (~80/step, ~230MB each = the gated-DeltaNet recurrent state). Phase 1's cuda_gpu_trace lumped this into compute. - Decode GPU-time decomposition (% of kernel+memcpy busy): gated_delta_net 23.4%, get_rows 21.9%, D2D state copy 18.9%, FP4 GEMV 15.5%, FP4 GEMM 10.4%, full attention 0.4%. Grouped: SSM/gated-DeltaNet machinery ~67%, FP4 matmul ~28%, full attention (all paged-attn optimizes) ~0.4%. Verdict: not graphs, not the host loop, not primarily FP4 GEMM, not attention. Paged attention touches ~0.4% of decode on this model, so no paged/graph/ block-table change can move decode_agg. The lever is the ggml qwen35 gated-DeltaNet decode: kill the per-layer recurrent-state D2D copy and fuse the get_rows gather into the recurrence (vLLM's fused_recurrent_gated_delta_rule keeps state in place). Ceiling: -copy ~146->180; -copy-and-gather ~146->247 t/s. No code patch (the lever is an SSM-path rewrite, orthogonal to paged attention); patches/paged/0018 stays free. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../patches/paged/A2_CUDAGRAPH_DECODE.md | 120 ++++++++++++++++++ 1 file changed, 120 insertions(+) diff --git a/backend/cpp/llama-cpp/patches/paged/A2_CUDAGRAPH_DECODE.md b/backend/cpp/llama-cpp/patches/paged/A2_CUDAGRAPH_DECODE.md index 7f8312773..2965efd20 100644 --- a/backend/cpp/llama-cpp/patches/paged/A2_CUDAGRAPH_DECODE.md +++ b/backend/cpp/llama-cpp/patches/paged/A2_CUDAGRAPH_DECODE.md @@ -175,3 +175,123 @@ second-order floor, not the present bottleneck. No code patch in Phase 1 (graphs are not the lever, so there is no paged graph-capture patch to land). Evidence: `~/bench/a2_4cell_v2/`, `~/bench/a2_probe`, `~/bench/a2_probe2`, `~/bench/a2_nsys/*.nsys-rep` on the DGX. + +# Phase 2 - the real decode lever, located (per-kernel decomposition) + +Phase 1 ended on "decode is GPU-compute-bound; the 2.6x gap to vLLM lives in the +per-step GPU kernel work (FP4 GEMM + attention at batch 128)." Phase 2 measured +that per-step GPU work directly - per kernel and per memcpy, on the Phase-1 nsys +`.sqlite` reps - and the "FP4 GEMM + attention" attribution does not survive the +measurement. Two corrections, then the lever. + +The conditional Phase 2 fix (make the paged decode graph-capturable) is moot: +Phase 1 already showed the default paged decode captures, and the fresh re-check +below reconfirms the graph win is noise. Neither Phase 2 branch (within-step graph +fix / between-step host loop) is the lever; the lever is a third thing, measured +here. + +## Fresh re-confirmation: graphs are not the lever + +Independent run (npl128, ntg32, paged, fusion off), not reusing Phase 1's table: + +| paged decode | S_TG t/s | vs vLLM 391 | +|---------------|----------|-------------| +| graphs ON | 146.03 | 37.3% | +| graphs OFF | 144.90 | 37.1% | + ++0.78%, within noise - same verdict as Phase 1's 4-cell. The ON nsys rep is also +99.5% busy with the same ~3267 ms of memcpy as OFF: graphs capture the memcpy +nodes too, so they cannot remove either the copies or the compute. + +## Correction 1: the model is a hybrid SSM, not a plain transformer + +`q36-27b-nvfp4.gguf` has `general.architecture = qwen35` with +`qwen35.ssm.{conv_kernel,state_size,group_count,time_step_rank,inner_size}`. The +decode-window kernel cadence (per step, ~19.8 steps in the window) is 48 +`gated_delta_net_cuda` + 48 `ssm_conv_f32` vs 16 `flash_attn_tile`, i.e. **48 +gated-DeltaNet linear-attention layers : 16 full-attention layers** (a 3:1 +hybrid, Qwen3-Next family). Paged attention only touches the 16 full-attention +layers. + +## Correction 2: the 99.4% "busy" is ~19% D2D memcpy, not compute + +Interval-union sweep over the steady decode window (last 17 s of the npl128/ntg24 +OFF rep; single CUDA stream; running-max-end so it is overlap-correct): + +| activity set | GPU busy | idle | +|------------------------|----------|-------| +| kernels only | 80.2% | 19.8% | +| kernels + memcpy (all) | 99.4% | 0.6% | + +The 969 inter-kernel gaps (>=1 ms, ~48/step) that drop kernels-only to 80% are +filled by **D2D memcpy: 1584 copies/run (~80/step), ~230 MB each, ~2 ms each, +356 GB moved in 17 s**. At batch 128 a ~230 MB block is the gated-DeltaNet +recurrent state; these are the per-SSM-layer state copies. (HtoD copies = the +paged block-table/index upload: 731/run but only 3 ms total, negligible; DtoH +47 ms.) Phase 1's `cuda_gpu_trace`-based 99.4% counted these memcpys as "busy" +and lumped them into "GPU kernel compute" - they are memory movement, and they +are addressable. + +## Decode GPU-time decomposition (% of kernel+memcpy busy) + +OFF/eager rep, steady window. `/step` = instances per decode step. + +| share | activity | /step | role | +|-------|-----------------------------------|-------|-------------------------------| +| 23.4% | gated_delta_net_cuda | 48 | linear-attn recurrence | +| 21.9% | k_get_rows_float | 97 | SSM state / conv-state gather | +| 18.9% | MEMCPY DtoD | 80 | SSM recurrent-state copy | +| 15.5% | mul_mat_vec_q (nvfp4, ncols=1) | 48 | FP4 GEMV | +| 10.4% | mul_mat_q (nvfp4) | 352 | FP4 GEMM | +| 1.9% | quantize_mmq_nvfp4 | 448 | act requant for MMQ | +| 1.0% | concat_cont | 48 | SSM state glue | +| 0.8% | ssm_conv_f32 | 48 | SSM short conv | +| 0.7% | unary_gated_op silu | 112 | SSM gating | +| 0.4% | flash_attn_tile/_ext | 16 | FULL attention (paged) | + +Grouped: +- gated-DeltaNet / SSM machinery (recurrence + get_rows gather + DtoD state copy + + conv + gating glue): **~67% of decode**. +- FP4 matmul (GEMV + GEMM + requant + stream-k fixup): **~28%**. +- Full attention - everything paged attention optimizes: **~0.4%**. + +## Verdict and scope of the real lever + +1. CUDA graphs: not the lever (Phase 1, re-confirmed: +0.78%, noise). They capture + the memcpy too, so they cannot touch the copies or the compute. +2. Host loop: not the lever (true host idle in the union is 0.24%, ~41 ms/17 s). +3. FP4 GEMM: secondary, ~28%. Consistent with Track B P2a (making the FP4 GEMM 26% + faster left decode_agg flat) - it was never the long pole. +4. Paged / full attention: ~0.4% of decode. **No paged-attention change (graphs, + block-table stabilization, gather rewrite) can move decode_agg on this model** + - it optimizes under half a percent of the step. This is the structural reason + A.2, and the paged-decode track generally, cannot close the vLLM gap on + q36-27b: the model barely uses the path being optimized. + +The throughput lever is the ggml **qwen35 gated-DeltaNet decode**. Per SSM layer +per step it re-materializes and D2D-copies the full recurrent state (~230 MB at +batch 128; ~80 copies/step, ~18 GB/step) and feeds the recurrence through ~2 +`get_rows` gathers, so ~61% of decode (state copy + state gather + recurrence) is +SSM state plumbing. vLLM's gated-DeltaNet decode (the flash-linear-attention +`fused_recurrent_gated_delta_rule` path) keeps the state in place and fuses the +gather into the scan, avoiding both the per-layer D2D copy and the gathers. + +Next-step scope (the real lever, to be done in the ggml/llama qwen35 SSM path - +not paged-attn, not a graph capture, not a block-table tweak): +1. Eliminate the per-layer recurrent-state D2D copy: update the state tensor + in place (or double-buffer / write-back), so the recurrence consumes and + produces the persistent state without a full-state copy each layer each step. +2. Fuse the `get_rows` state / conv-state gather into the recurrent kernel. + +Ceiling from this rep (upper bound; assumes the work is fully removed, not just +overlapped): +- remove the DtoD state copy: reclaim 18.9% -> ~146 to ~180 t/s. +- remove copy + gather: reclaim ~41% -> ~146 to ~247 t/s, which puts llama within + ~1.6x of vLLM 391 with the FP4 GEMM still untouched. + +No code patch in Phase 2 either: the lever is a gated-DeltaNet decode rewrite in +the SSM path, too large for this measurement pass and orthogonal to paged +attention. `patches/paged/0018` stays free. Evidence on the DGX: +`~/bench/a2_decompose/decode_decomp.txt` (per-kernel table + reproducing SQL in +its header), `~/bench/a2_decompose/SUMMARY.txt`, and the Phase-1 reps +`~/bench/a2_nsys/paged_off_npl128.sqlite` / `paged_on_npl128_node.sqlite`.