diff --git a/backend/cpp/llama-cpp/patches/paged/A2_CUDAGRAPH_DECODE.md b/backend/cpp/llama-cpp/patches/paged/A2_CUDAGRAPH_DECODE.md new file mode 100644 index 000000000..7f8312773 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/A2_CUDAGRAPH_DECODE.md @@ -0,0 +1,177 @@ +# A.2 - CUDA-graphing the paged decode: measured lever + gap diagnosis + +Phase 1 (measure, do not punt). DGX GB10 (sm_121), dev tree `~/llama-paged-dev` +HEAD 089f78d (patch 0017), `build-cuda`. Model `q36-27b-nvfp4.gguf` (dense), +harness `llama-batched-bench`, fusion held OFF (`LLAMA_FUSE_NVFP4_QUANT=0`) for a +clean stock-kernel baseline. `decode_agg` = the `S_TG t/s` column. + +## TL;DR verdict + +CUDA-graphing the paged decode is **NOT a real throughput lever** (ceiling well +under 1%). The steady decode step is **GPU-compute-bound: 99.4-99.5% GPU-busy**. +Total GPU idle is ~0.5-0.6% of the step, split into within-step launch gaps +(0.37%, the only thing CUDA graphs remove) and a between-step host-loop gap +(0.24%, one ~2 ms gap per step). Graphs already engage on the default paged +decode and do collapse the launch gaps (0.37% -> 0.11%), but the GPU stays +99.4-99.5% busy either way, so decode_agg is unchanged. The 2.6x gap to vLLM +(148 vs 391) lives in the per-step GPU **kernel work** (FP4 GEMM + attention at +batch 128), not in launch overhead or the host loop. + +The premise that "the paged decode runs eager (graphs reused=0)" did not survive +measurement: at the benchmarked context the default paged decode captures and +replays graphs exactly like stock non-paged. Two measurement traps (below) +explain the earlier "reused=0 / gap-bound" reading. + +## Method note: a graph-enable trap that was corrected + +`GGML_CUDA_DISABLE_GRAPHS` is read with `getenv(...) != nullptr` +(`ggml/src/ggml-cuda/common.cuh:1234`), so setting it to an **empty** string +still disables graphs. A first 4-cell pass that used +`GGML_CUDA_DISABLE_GRAPHS=""` for the "graphs ON" cells therefore ran graphs OFF +in all four cells (an OFF-vs-OFF comparison). The numbers below ("v2") unset the +variable with `env -u` for the ON cells. The `-lv 99` probe is unaffected (it +never set the variable). + +## Step 1 - the 4-cell decode_agg table (corrected, graphs genuinely enabled) + +npp 128, ntg 128, npl 32 and 128, c 40960, b/ub 2048, fa on. `S_TG t/s`: + +| cell | npl 32 | npl 128 | +|------------------|---------|---------| +| stock_graphon | 116.47 | 148.41 | +| stock_graphoff | 115.17 | 148.21 | +| paged_graphon | 116.21 | 148.60 | +| paged_graphoff | 114.62 | 147.65 | + +ON vs OFF (the graph win): + +| config | npl 32 | npl 128 | +|--------|--------|---------| +| stock | +1.13% | +0.13% | +| paged | +1.39% | +0.64% | + +- (a) Does STOCK get a graph win? Essentially no: +0.13% at npl 128, +1.13% at + npl 32 (small-batch, where per-kernel launch overhead is relatively larger). + All within run-to-run noise (~1% at npl 32, ~0.2% at npl 128). +- (b) Does PAGED get a graph win? Same picture: +0.64% / +1.39%. Paged is NOT + eager at this config (see Step 2); it captures graphs like stock. +- (c) LEVER SIZE (proxy = stock graph win, now genuinely measured): +0.13% at + npl 128, +1.1% at npl 32. Negligible vs the 2.6x (=+164%) gap to vLLM. + +All four cells sit at ~148 (npl 128) / ~115 (npl 32) within ~1%. The ~148 wall is +shared by stock and paged; it is not paged-specific. Calibration cross-check +(paged ON, ntg 64): 147.64, matching the reference 148-149. + +## Step 2 - why the "eager" premise is wrong, and what actually mutates + +CUDA-graph state machine (`ggml_backend_cuda_graph_compute` in +`ggml/src/ggml-cuda/ggml-cuda.cu`): warmup completes after a step whose node +properties did not change vs the previous step; any later change logs +`CUDA graph warmup reset` and reverts to eager until stable again. +`ggml_cuda_graph_update_required` memcmps every node's full `ggml_tensor` plus +each src's `data` ptr / `ne` / `nb`. + +`-lv 99` probe, short context (npp 64, ntg 32, ctx <= 96): +- stock: `warmup complete` x2, `warmup reset` x0. +- paged: `warmup complete` x2, `warmup reset` x0. +Both capture and then replay silently. The `CUDA Graph id N reused` line stays 0 +for both because llama rebuilds the cgraph each ubatch (new `cgraph->uid`), so +the uid fast-path never fires; the graph is still replayed via the +`instance != nullptr` path, which logs nothing. **"reused=0" is a false negative, +not evidence of eager execution.** (Trap #1.) + +Cadence probe (npp 200, ntg 320, npl 4, ctx 200->520, crosses the 256 and 512 +token boundaries), counts over ~320 decode steps: + +| path | complete | reset | interpretation | +|-------------------------------|----------|-------|-------------------------------| +| paged in-kernel (default) | 10 | 8 | resets only at 256-boundaries | +| paged gather (KV_PAGED_GATHER)| 0 | 0 | never captures -> pure eager | +| stock non-paged | 10 | 8 | identical 256-cadence | + +The 8 resets cluster at the two boundary crossings (timestamps ~9.9 s and ~34 s), +not per-step. The default paged decode is therefore captured for ~97% of steps, +re-warming only every ~256 tokens, with the **same cadence as stock**. + +What mutates (the block-table / gather input): +- in-kernel decode (default): the block-table graph input + `idx = ggml_new_tensor_2d(ctx0, I32, n_view, n_stream)` with + `n_view = GGML_PAD(n_gather, 256)` (`src/paged-attn.cpp:199,213`). Its `ne[0]` + steps 256 -> 512 -> 768 only when the context crosses a 256-token boundary. The + kq_mask input (ne0 = n_kv, also padded to 256) steps in lockstep. So the + property change is per-256-tokens, not per-step. +- gather fallback (`LLAMA_KV_PAGED_GATHER=1`, transposed-V, or prefill): the + index input `idx = ggml_new_tensor_2d(ctx0, I32, n_gather, n_stream)` + (`src/paged-attn.cpp:106`) has `ne[0] = n_gather` (UNPADDED), which grows every + step (the unit's own comment, `src/paged-attn.cpp:28-30`: "n_gather grows every + step"). That changes a node property every step, warmup never completes, and + the path runs pure eager. This is the only "graphs reused=0" path, and it is + not the default decode path. + +`LLAMA_KV_PAGED_DEBUG` dump at ctx 201 (first 2 decode calls, identical across +the pair): `in-kernel decode n_stream=4 n_kv=256 n_gather=201` -> block-table +`ne[0] = GGML_PAD(201,256) = 256`, stable until n_gather crosses 256. + +## Step 3 - where the step time goes (nsys), and a second trap + +npl 128, ntg 24, ctx 56 (< 256, so the ON run stays captured after warmup). +Idle split by gap size: within-step launch gaps < 1 ms, between-step host gaps +>= 1 ms. Steady window = 40%-97% of the trace span (excludes model load / graph +reserve / prefill one-offs). + +Trap #2: `nsys --trace=cuda` does NOT emit the kernels INSIDE a replayed CUDA +graph into `cuda_gpu_trace` by default. The graphs-ON trace had only 15,279 GPU +rows vs 84,946 for the identical OFF workload and reported a bogus 0.3% GPU-busy. +Re-profiling the ON case with `--cuda-graph-trace=node` restored all 84,946 rows +and 99.5% busy. **Any "decode is idle/gap-bound" reading taken from a graphs-ON +nsys trace without `--cuda-graph-trace=node` is artifactually idle-inflated** - +the likely source of the earlier "freed GPU time became idle gaps" conclusion. + +Reliable steady-state numbers: + +| trace | GPU rows | busy | within-step idle | between-step idle | host gap/step | +|--------------------------------|----------|--------|------------------|-------------------|---------------| +| OFF (eager) | 84,946 | 99.4% | 0.37% | 0.24% | ~2.0 ms | +| ON (captured, node-trace) | 84,946 | 99.5% | 0.11% | 0.38% | ~1.9 ms | + +- CUDA graphs replay (cudaGraphLaunch=46) and collapse the launch path: ON has + ~15k kernel launches/run vs OFF ~80k (cudaLaunchKernel 6,024 vs 31,764, plus + ExC 9,049 vs 48,165). That cuts within-step launch idle from 0.37% to 0.11%. +- But the GPU is 99.4-99.5% busy in both, so decode_agg is unchanged. +- Between-step host idle is one ~2 ms gap per decode step (the 128-way sample + + update_slots + batch build), 0.24-0.38% of the ~896 ms step. + +Per-step decomposition at npl 128: ~896 ms/step, of which ~890 ms is GPU kernel +compute, ~2 ms host-loop gap, ~3 ms (eager) / ~1 ms (captured) launch gaps. + +## The load-bearing question, answered + +Within-step or between-step? **Neither is large.** The steady decode is 99.4% +GPU-busy; the entire idle budget is ~0.6% of the step. CUDA graphs already remove +the within-step launch fraction (0.37% -> 0.11%), and the between-step host gap is +~2 ms/step (0.24%). There is no large gap for a host-loop rewrite to reclaim +either; the host loop is currently **hidden under GPU compute** (the GPU stays +busy while the host syncs/schedules). It would only become a lever once the +kernels are fast enough to drop GPU-busy below the host time, i.e. it is a +second-order floor, not the present bottleneck. + +## Verdict + +1. CUDA-graphing the paged decode is not the lever. Graphs already engage on the + default decode; capturing reduces within-step launch idle from 0.37% to 0.11% + but leaves the GPU 99.4-99.5% busy, so decode_agg moves by ~0% (measured + +0.1% to +0.6% at npl 128, +1.1% to +1.4% at npl 32, all within noise). +2. The between-step host loop is not the present lever either (0.24%, ~2 ms/step, + hidden under GPU compute). It is the candidate floor only after the kernels + speed up. +3. The decode is GPU-compute-bound at this NVFP4 fusion-OFF baseline. The 2.6x + gap to vLLM is in the per-step GPU kernel work (FP4 GEMM + attention at batch + 128). That, not graphs and not the host loop, is the throughput lever. +4. Corrected premises: paged is not perpetually eager (it captures with a + 256-token reset cadence identical to stock); "graphs reused=0" was a uid + fast-path false negative; and a graphs-ON nsys trace under-counts GPU-busy + unless `--cuda-graph-trace=node` is set. + +No code patch in Phase 1 (graphs are not the lever, so there is no paged +graph-capture patch to land). Evidence: `~/bench/a2_4cell_v2/`, `~/bench/a2_probe`, +`~/bench/a2_probe2`, `~/bench/a2_nsys/*.nsys-rep` on the DGX.