docs(paged): ground vLLM 0.23.0 eager-decode architecture vs llama.cpp

Decompose vLLM's enforce_eager decode step (attention / weight GEMM / sampling / host loop) on GB10 (DGX Spark, sm_121) and attribute the measured ~2.4x NVFP4 decode-throughput gap to its parts, from source reading plus the existing nsys decode trace and H2H bench logs. Key finding: the gap is dominantly a KERNEL-efficiency gap (~80-90%), not a host-overhead gap. llama's GPU is already ~94.6% busy during steady decode, so a CUDA-graphed decode is a minority lever (~10-20% of the gap, bounded by the GPU-idle bubble), not the silver bullet. vLLM's wins: in-kernel paged-decode read (no gather tax), faster long-context attention, fused native-FP4 / grouped-Marlin GEMM, and O(1)-in-ctx GDN linear-attention layers on these Qwen3.6 hybrids. vLLM achieved 2.4x with synchronous scheduling and no CUDA graphs. Evidence: vllm 0.23.0 source (gpu_model_runner, flash_attn/gdn backends, modelopt/marlin GEMM, v1/sample), reproduced nsys kernel categorization (cat2.py), and QWEN36_NVFP4_BENCH / DECODE_GAP_STUDY / CONTINUOUS_BATCH_SCHEDULER_SCOPE. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-24 08:38:51 -04:00 · 2026-06-24 07:44:07 +00:00
parent 5a38dd3f09
commit fccbb4082d
1 changed files with 315 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/VLLM_DECODE_GROUNDING.md
+++ b/backend/cpp/llama-cpp/patches/paged/VLLM_DECODE_GROUNDING.md
@@ -0,0 +1,315 @@
+# vLLM 0.23.0 eager-decode grounding: where the ~2.4x decode gap to llama.cpp comes from
+
+Source-reading + grounding only (no GPU, no benchmarking, no llama code changes). This
+decomposes vLLM 0.23.0's per-decode-step work in `enforce_eager` mode and attributes the
+measured ~2.4x decode-throughput gap on GB10 (DGX Spark, sm_121) to its parts, so the
+throughput thread can decide what llama.cpp would actually need (CUDA-graphed decode vs new
+kernels) before anyone touches a kernel.
+
+Hardware: NVIDIA GB10 / DGX Spark, sm_121 (CC 1210 = `GGML_CUDA_CC_DGX_SPARK`), unified
+LPDDR5x ~273 GB/s. vLLM install read: `/home/mudler/vllm-bench/lib/python3.12/site-packages/vllm/`
+(on `dgx.casa`, read-only). Evidence: engine logs `~/bench/h2h_dense_vllm.log`,
+`~/bench/h2h_moe_vllm.log`; nsys decode trace `~/bench/decode_study/srv_decode2.sqlite`
+(reproduced here via `cat2.py`); committed `QWEN36_NVFP4_BENCH.md`, `DECODE_GAP_STUDY.md`,
+`CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`.
+
+## TL;DR (the evidence-based answer)
+
+At batch ~128, ~1024 ctx, NVFP4, `enforce_eager` (no CUDA graphs on either side), vLLM decodes
+~2.4x faster than llama.cpp. Decomposed:
+
+1. **The gap is dominantly a KERNEL-efficiency gap, not a host-overhead gap.** The strongest
+   single datum: during steady llama decode the GPU is **~94.6% busy** (nvidia-smi, real run) /
+   85.5% in the nsys window (`DECODE_GAP_STUDY.md`; nsys adds gaps). A GPU that is already ~95%
+   busy has at most ~5% exposed host bubble, so a CUDA graph (which only removes host/launch
+   overhead) can recover at most that bubble. **CUDA-graphing llama's decode is therefore a
+   minority lever: on the order of ~5-15% of the step, i.e. roughly ~10-20% of the 2.4x.** The
+   remaining ~80-90% is the GPU spending its busy time in kernels that are simply slower per unit
+   work than vLLM's.
+
+2. **vLLM's eager decode step is cheap on the host by construction**, so its host time is small
+   to begin with and hides behind the async CUDA stream: persistent pre-allocated input buffers
+   updated with vectorized numpy (no per-token Python), attention metadata built once per step and
+   shared across all layers, no GPU->CPU sync in the hot path, and a fixed small kernel-launch
+   sequence per layer (2 ops per Linear, 2 grouped Marlin launches for *all* MoE experts).
+   `async_scheduling` was **off** in this run (absent from both engine logs; default resolves to
+   the synchronous `Scheduler`, `config/scheduler.py:168-176`), so vLLM achieved the 2.4x with
+   *synchronous* per-step scheduling. The host advantage is structural, not pipelining.
+
+3. **Where vLLM's kernels win:** (a) attention reads paged KV **in-kernel** via a block table in
+   one batched `flash_attn_varlen_func` launch, with **no gather/copy** (vLLM never pays llama's
+   paged `get_rows` + `cpy` tax, which is ~36% of llama's *paged* step); (b) the dense NVFP4 GEMM
+   is a **native FP4-MMA cutlass** kernel with the activation-quant **fused** into the preceding
+   RMSNorm/SiLU (no standalone `quantize_mmq` requant pass); (c) the MoE experts are **one grouped
+   Marlin kernel per projection for all experts** (W4A16, in-kernel dequant); (d) on these Qwen3.6
+   models a fraction of layers are **GDN linear-attention** whose decode is an **O(1)-in-context
+   recurrent state update**, not an O(ctx) KV read.
+
+4. **Sampling is not the gap** on either side: vLLM samples all ~128 sequences with a handful of
+   batched on-GPU kernels (FlashInfer), greedy and a heavy sampler chain cost the same; this
+   mirrors llama's own finding (`DECODE_GAP_STUDY.md`: greedy 1343 ms == 5-sampler 1346 ms).
+
+## The measured gap (apples-to-apples, both eager)
+
+From `QWEN36_NVFP4_BENCH.md` (matched NVFP4 weights, one GB10 box, vLLM 0.23.0
+`--enforce-eager`, llama patch 0015 + budget-256), decode aggregate tok/s at npl128:
+
+| model | llama (best) | vLLM | ratio | per-step (128 tok) llama -> vLLM |
+|-------|-------------:|-----:|------:|----------------------------------|
+| DENSE Qwen3.6-27B | 161.2 | 390.7 | **2.42x** | ~795 ms -> ~328 ms |
+| MoE Qwen3.6-35B-A3B | 333.5 | 811.1 | **2.43x** | ~384 ms -> ~158 ms |
+
+Both models converge to ~41% of vLLM at npl128 after llama's prefill-starvation is removed
+(patch 0013), and at npl8 the kernels are at parity (dense 99%, MoE 84%). So the residual ~2.4x
+is a steady-state decode property at high batch, not a prefill or scheduler artifact (the
+scheduler was separately proven not to be the lever: a clean all-128-decoding run still tops out
+at 157-161 dense / 333 MoE - `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`).
+
+## Confirmed configuration (both sides eager, no CUDA graphs)
+
+vLLM, both models (engine logs):
+- `enforce_eager=True`, `CompilationMode.NONE`, `cudagraph_mode=<CUDAGraphMode.NONE>`:
+  `"Enforce eager set, disabling torch.compile and CUDAGraphs ... -cc.mode=none
+  -cc.cudagraph_mode=none"`, `"Cudagraph is disabled under eager mode"`. So no torch.compile, no
+  inductor, no graph capture: the model runs as pure eager dispatch of custom ops.
+- Attention: `"Using FLASH_ATTN attention backend out of ['FLASH_ATTN','FLASHINFER','TRITON_ATTN',
+  'FLEX_ATTENTION']"`, `"Using FlashAttention version 2"`.
+- Dense weight GEMM: `"Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM"` (native W4A4
+  cutlass FP4-MMA), `"Enabled custom fusions: norm_quant, act_quant"`, FlashInfer autotuned the
+  `fp4_gemm` (16 configs) at startup.
+- MoE weight GEMM: `"Using 'MARLIN' NvFp4 MoE backend out of ['FLASHINFER_TRTLLM',...,'MARLIN',
+  'EMULATION']"` with `"Your GPU does not have native support for FP4 computation ... Weight-only
+  FP4 compression will be used leveraging the Marlin kernel"` (so MoE experts = W4A16 weight-only
+  Marlin: in-kernel dequant + bf16 MMA), plus `"FlashInferFP8ScaledMM"` for the FP8 attention
+  linears.
+- Both models are **hybrid GDN**: `"Using Triton/FLA GDN prefill kernel"` and `"Setting attention
+  block size to 784/1056 tokens to ensure attention page size >= mamba page size"` (dense 784, MoE
+  1056). A decode-time `fused_recurrent_gated_delta_rule_packed_decode_kernel` is JIT-compiled.
+- Sampling: `"Using FlashInfer for top-p & top-k sampling."`
+- `async_scheduling` not present in either log -> synchronous `Scheduler`.
+
+llama side (the brief's premise, corroborated by `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md` review):
+`-fa on`, paged KV, eager (no engaged CUDA graphs at batched decode). The `DECODE_GAP_STUDY.md`
+nsys run explicitly set `GGML_CUDA_DISABLE_GRAPHS=1` to match.
+
+## Decomposition of vLLM's eager decode step
+
+All file paths below are under
+`/home/mudler/vllm-bench/lib/python3.12/site-packages/vllm/`. The driver is
+`v1/worker/gpu_model_runner.py::execute_model` (line 4005): host preprocess under
+`synchronize_input_prep()`, then `_model_forward` under `set_forward_context`, then `compute_logits`;
+sampling is a separate `sample_tokens` (line 4357). Under eager, `_determine_batch_execution_and_padding`
+(line 3768) dispatches `CUDAGraphMode.NONE`, and `_model_forward` (line 3718) just calls
+`self.model(...)` directly: no capture, no replay, same code every step.
+
+### (a) Attention - one batched in-kernel paged-decode launch + O(1) GDN layers
+
+- **Full-attention layers (FA2):** `v1/attention/backends/flash_attn.py`. `FlashAttentionImpl.forward`
+  (667-848) issues **one** `flash_attn_varlen_func` (796-818) over all ~128 decode tokens, passing
+  `key_cache`/`value_cache` (the raw paged block pools, **not gathered**), `cu_seqlens_q`,
+  `seqused_k`, and **`block_table=attn_metadata.block_table`**. The kernel walks the block table to
+  fetch each sequence's KV pages directly. In-kernel paged read confirmed: there is **no gather/copy**
+  in the Python layer; the only KV write is `reshape_and_cache_flash` (a scatter of the new token via
+  `slot_mapping`). FA2 disables vLLM's AOT host scheduler (`aot_schedule = (fa_version==3)` is False,
+  333), so `schedule()` returns `None` (445-469): the per-step metadata `build()` (388-575) is **pure
+  reference/scalar assembly**, no Python loop over the 128 sequences, no host scheduling, no sync.
+- **Built once per step, reused across layers:** `supports_update_block_table=True` (300); the first
+  full-attn layer calls `build()`, every later layer reuses it via `update_block_table()` (577-586,
+  a `copy.copy`). So `build()` runs **once per decode step** for the whole KV group, not per layer.
+- **GDN linear-attention layers (the hybrid half):** `model_executor/layers/mamba/gdn/
+  qwen_gdn_linear_attn.py`, kernels in `model_executor/layers/fla/ops/fused_recurrent.py`. Pure decode
+  takes `_forward_core_decode_non_spec` (1644-1696): two state-update kernels only -
+  `causal_conv1d_update` + `fused_recurrent_gated_delta_rule_packed_decode` (Triton kernel 255-336,
+  grid `(NV, B*HV)` = one batched launch over all 128 rows). Each program updates a **fixed-size
+  [K,V] recurrent state** (`b_h *= exp(g); b_h += (beta*(v - h.k)) outer k; o = h.q`) - **no loop over
+  the 1024 past tokens, no KV read.** This is **O(1) in context length**, while FA2 streams ~ctx KV
+  per head per row. On these Qwen3.6 models the GDN layers make a chunk of the decode cost flat in
+  ctx, a structural cheapness llama only gets if its GGUF implements GDN the same way (see caveat).
+
+### (b) Weight GEMM - native FP4-MMA (dense) / grouped Marlin (MoE), M-batched, fused quant
+
+- **Dense NVFP4 linear:** `model_executor/layers/quantization/modelopt.py::ModelOptNvFp4LinearMethod.apply`
+  (1226-1232) -> `model_executor/kernels/linear/nvfp4/flashinfer.py::apply_weights` (56-89): exactly
+  two GPU ops - `scaled_fp4_quant` (activation -> packed FP4 + blockscale) then
+  `flashinfer_scaled_fp4_mm` (the autotuned `fp4_gemm`, a **native W4A4 cutlass FP4-MMA** whose
+  **dequant is fused into the MMA epilogue** via the precomputed `alpha = in_gscale*w_gscale`). The
+  activation-quant is itself folded away: `compilation/passes/fusion/rms_quant_fusion.py:98`
+  (`norm_quant`: RMSNorm -> `scaled_fp4_quant` fused) and `act_quant_fusion.py:40,128`
+  (`act_quant`: SiLU+mul -> FP4 fused). **There is no standalone full-tensor requantize pass** like
+  llama's `quantize_mmq`, and the weight is never dequantized to a temp buffer.
+- **MoE experts (Marlin W4A16):** `model_executor/layers/fused_moe/experts/marlin_moe.py`.
+  `fused_marlin_moe` (227) does **one** `moe_align_block_size` token-sort then `_fused_marlin_moe`
+  (59) issues **exactly two grouped kernels** - `moe_wna16_marlin_gemm` for gate_up (137) and for
+  down (194) - **each a single launch covering ALL experts** (it walks `expert_ids`/`sorted_token_ids`
+  internally; no Python loop over experts), with a `silu_and_mul` between and a `moe_sum` reduce
+  after. W4A16 means weights are dequantized in-kernel and activations stay bf16 (never requantized).
+- **Decode-M batching (the key throughput property):** the dense GEMM reshapes activations to (M, K)
+  with M = total decode tokens (~128) and reads each FP4 weight **once for all 128 tokens**; the MoE
+  grouped GEMM reads each routed expert's weight **once** for the ~M*topk/E tokens routed to it. At
+  M~128 with FP4 weights these are weight-read / memory-bound (correct: the GB10 LPDDR5x ~273 GB/s
+  is the floor), but the bytes are amortized over the whole batch. This is the ideal case and it is
+  the same regime llama is in - so the GEMM gap is kernel efficiency (fused quant + native FP4 MMA),
+  not a batching defect.
+- **Host cost per layer (eager):** each `Linear.apply()` dispatches at most 2 `torch.ops` kernels; a
+  dense layer's GEMM+norm/act portion is ~7-11 launches, a MoE expert block is ~5-6 launches **for all
+  experts combined** (expert count does not multiply launches). Fixed, small, no per-tile/per-expert
+  Python.
+
+### (c) Sampling - fully batched on-GPU, negligible
+
+`v1/sample/sampler.py::Sampler.forward` (72) operates on the whole `[num_seqs, vocab]` logits
+tensor: batched `argmax` (greedy, 240) or temperature `div_` + one FlashInfer
+`top_k_top_p_sampling_from_logits` (`v1/sample/ops/topk_topp_sampler.py:493`) + `torch.where`
+(296-301). **No per-sequence Python loop** in the hot path. Per-seq params live as pre-staged GPU
+tensors `temperature/top_p/top_k[num_seqs]` (`v1/worker/gpu_input_batch.py:184-205`), copied once via
+non-blocking H2D and rebuilt only on batch change (`refresh_metadata`, 815-829). Greedy and the full
+chain are the same batched-op class. Sampled-token D2H is async (CUDA-event gated, 243-313);
+detokenization runs on CPU in the async output processor (`v1/engine/output_processor.py`). Sampling
+is a negligible tail and does not stall the GPU loop - exactly as on the llama side.
+
+### (d) Host / Python per-step loop - cheap by construction, hidden behind the async stream
+
+`execute_model` host prep, all incremental on persistent buffers (`_prepare_inputs`, 1872+):
+- `block_table.commit_block_table` started **first** to overlap its copy with following CPU work
+  (1890); each step appends only newly-allocated block ids (`append_row`), usually <=1 at decode.
+- positions / token gather are **vectorized numpy + a single `torch.index_select`** into the
+  pre-allocated `input_ids.cpu` (1928-1939); `query_start_loc`/`seq_lens` set by slice ops
+  (1979-1990). `slot_mapping` is one Triton kernel (`v1/worker/block_table.py`). **No per-token, no
+  per-request Python loop** in the steady decode path.
+- `CommonAttentionMetadata` assembled once (2287-2305), then the attention builder runs once per KV
+  group (see (a)).
+- The forward runs under `set_forward_context(...)` with `cudagraph_runtime_mode=NONE`; `_model_forward`
+  is a direct `self.model(...)`.
+- **No GPU->CPU sync in the hot path:** the sampled-token copy is `non_blocking` + event-gated;
+  `execute_model` returns after launching the forward, and the cheap host prep for the next step
+  overlaps the GPU executing the current step on the async CUDA stream (CUDA launches are
+  non-blocking). `async_scheduling` was off, so this overlap is just ordinary CUDA async, not
+  pipelined scheduling - yet it is enough because the host work is so small.
+
+What llama-server's per-step C++ loop pays that vLLM does not (host side, graph-addressable):
+ggml rebuilds/reallocates the compute graph each decode step and dispatches ~1k kernel launches from
+the loop on the weak Grace ARM cores (`CONTINUOUS_BATCH_SCHEDULER_SCOPE.md` review). vLLM's persistent
+buffers + build-once-reuse metadata + fixed launch sequence are exactly the things that keep its eager
+step host-cheap; llama could borrow these (persistent device KV/block metadata, build the ggml graph
+once and reuse it, zero per-step host sync) to shrink the bubble **without** a full CUDA graph.
+
+## The llama side, for the split (nsys, reproduced)
+
+`~/bench/decode_study/cat2.py` over `srv_decode2.sqlite` (Qwen3-32B dense, pure full-attention, 64
+layers, batch 32, 1024 ctx, paged, eager), reproduced now:
+
+```
+window_span_s 24.960  sum_kernel_s 21.348  gpu_busy_pct 85.5
+ATTENTION (flash_attn_ext_f16) 10.177 s  47.7%
+kv_copy_cast (cpy_*)            3.903 s  18.3%
+embed_gather_rows (get/set)    3.803 s  17.8%   <- the PAGED gather tax
+GEMM_weight (mul_mat)          3.173 s  14.9%
+GEMM_act_quant (quantize_mmq)  0.172 s   0.8%
+rmsnorm/silu/rope/add          ~0.12 s   ~0.6%
+```
+
+So on llama's paged decode step: ~84% is KV/attention (attention 47.7% + KV copy 18.3% + paged
+gather 17.8%), ~16% is weight GEMM, and the host loop is **hidden** (GPU 85-94% busy; greedy ==
+heavy-sampler step time). Mapping each bucket to vLLM:
+
+| llama bucket (paged) | nsys % | vLLM equivalent | vLLM avoids it? |
+|----------------------|------:|-----------------|-----------------|
+| paged KV gather (`get_rows`) | 17.8% | block table read **in-kernel** | **Yes, entirely** (no such op) |
+| KV copy/cast (`cpy_*`) | 18.3% | KV written once into block pool, read in place | Mostly |
+| decode attention (`flash_attn_ext_f16`) | 47.7% | FA2 paged-decode varlen (+ O(1) GDN layers) | Same op, faster kernel; GDN is cheaper still |
+| weight GEMM + act quant | 15.7% | fused native-FP4 / grouped Marlin, no separate requant | Faster + removes the requant kernel |
+| host serving loop / sampling | ~0 (hidden) | cheap persistent-buffer prep, batched GPU sampling | Both hidden; vLLM also cheap |
+
+Note: the nsys decomposition is on **Qwen3-32B (pure attention)**; the 2.4x throughput numbers are on
+**Qwen3.6 hybrid GDN** models. The bucket *shares* differ between the two (GDN shifts work off
+attention), but the lesson - llama's step is GPU-bound on attention + the paged gather + FP4 GEMM,
+with the host hidden - transfers.
+
+## The split of the 2.4x: kernel vs host (graph-addressable)
+
+Anchored on the measured **~94.6% GPU busy** during steady llama decode (nvidia-smi,
+`DECODE_GAP_STUDY.md`):
+
+- **Host / CUDA-graph-addressable: the minority, ~5-15% of the llama step (=> ~10-20% of the 2.4x).**
+  A GPU that is ~95% busy exposes at most ~5% host idle; a CUDA graph (capture-once, replay) removes
+  per-step launch latency + ggml graph rebuild/realloc and can tighten inter-kernel gaps, plausibly
+  recovering ~5-15% of the step in the best case. On llama's ~795 ms dense step that is ~40-120 ms of
+  the ~467 ms gap. **A CUDA graph cannot close a 2.4x gap**, because the gap is mostly the GPU's busy
+  time, not idle. (The fraction shrinks further at batch 128 vs the nsys batch 32: the per-step launch
+  count is fixed while per-kernel work grows, so host overhead is a smaller share at higher batch.)
+- **Kernel efficiency: the majority, ~80-90% of the 2.4x.** The GPU's busy time goes into kernels that
+  are slower per unit work than vLLM's, decomposed:
+  - **the paged gather regression (~36% of llama's *paged* step; `get_rows`+`cpy`)** - vLLM never pays
+    it because it reads paged KV in-kernel. This is the single biggest discrete, llama-specific,
+    addressable chunk, but removing it only restores llama's own *stock* path; stock is still ~2x off
+    vLLM (`DECODE_GAP_STUDY.md`).
+  - **long-context decode-attention** (the largest residual; attention is ~48% of the step and grows
+    with ctx) - llama's `flash_attn_ext_f16` decode is slower than vLLM's FA2 paged-decode on sm_121,
+    and slower still than the O(1) GDN layers on these models.
+  - **the FP4 weight GEMM floor** (~15-30%) - vLLM fuses the activation-quant into the norm/SiLU and
+    uses native FP4-MMA / grouped Marlin; llama runs `mul_mat_q` + a separate `quantize_mmq` requant.
+
+## Ranked list: what llama would need to close the 2.4x, and how much each buys
+
+1. **Do not pay the paged gather at decode. [largest discrete, llama-addressable; ~36% of the paged
+   step]** Either disable paged KV for decode-latency workloads, or read paged blocks **in-kernel via
+   a block table** like vLLM (no `get_rows`/`cpy`). This is a kernel change (a real in-kernel
+   paged-decode read), not a graph change. Caveat: it only brings the paged path back to llama-stock;
+   stock is still ~2x off vLLM, so this is necessary but not sufficient.
+2. **Faster long-context decode-attention kernel. [biggest residual; partly structural]** A proper
+   flash-decoding / split-K-over-KV, GQA-grouped, in-kernel-paged decode kernel for sm_121 (this also
+   subsumes lever 1). Deep CUDA work, gated by kernel maturity on Blackwell-class parts. This is where
+   the context-scaling gap lives and where most of the 2.4x is.
+3. **Fused FP4 weight GEMM. [bounded; ~15-30%]** Fold the activation-quant into the preceding norm/SiLU
+   (vLLM's `norm_quant`/`act_quant`) and into the GEMM epilogue; use native FP4-MMA where the part
+   supports it. Removes the separate `quantize_mmq` pass. Bounded below by weight-read bandwidth
+   (~19 GB/step over 273 GB/s).
+4. **CUDA-graph the steady-state pure-decode step. [smallest, cheapest; ~10-20% of the gap]** Capture
+   the all-128-decoding step once and replay (it is already fixed-shape at steady decode - the
+   scheduler does not need to change to enable this, per `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md` P3).
+   Recovers the ~5% GPU-idle bubble + ggml per-step graph rebuild/realloc + launch latency on the weak
+   Grace cores. A real, independent, low-risk win, but bounded by the ~95%-busy measurement: it does
+   **not** close the kernel gap. Cheaper host-side half-measures that need no graph: persistent device
+   KV/block metadata, build the ggml graph once and reuse it, and remove any per-step host sync (mirror
+   vLLM's persistent-buffer + build-once-reuse + non-blocking-D2H pattern).
+5. **Verify llama's GDN/linear-attention decode path. [architectural, model-specific]** On these
+   Qwen3.6 hybrids vLLM runs the linear-attention layers as an O(1)-in-ctx recurrent state update. If
+   llama's GGUF runs those layers as full attention (O(ctx)) rather than a recurrent state, that is a
+   per-layer decode cost vLLM structurally avoids on exactly these models - check before attributing
+   the whole residual to the full-attention kernel.
+
+## Honest bottom line
+
+The ~2.4x eager decode gap is **dominantly a kernel-efficiency gap (~80-90%), not a host-overhead
+gap.** The decisive evidence is that llama's GPU is already ~94.6% busy during steady decode, so the
+CUDA-graph-addressable host slice is a minority (~10-20% of the gap), recoverable but bounded. The
+bulk of vLLM's advantage is concrete kernel work: an in-kernel paged-decode read that eliminates
+llama's gather/copy tax (~36% of the paged step), a faster long-context decode-attention kernel, a
+fused native-FP4 GEMM, and (on these specific models) O(1)-in-ctx GDN linear-attention layers. vLLM's
+host loop is cheap by construction (persistent buffers, build-once-reuse metadata, no hot-path sync,
+fixed small launch sequence) and it achieved the 2.4x with *synchronous* scheduling and *no* CUDA
+graphs - so the host is not where vLLM's lead comes from, and a CUDA graph is the cheapest but
+smallest of llama's available levers, not the silver bullet. The throughput effort should be scoped
+as kernel work (in-kernel paged-decode read + flash-decoding attention + fused FP4 GEMM) with a
+CUDA-graphed steady-state decode as a separate, bounded, lower-risk add-on.
+
+## Key source citations (on dgx.casa, read-only)
+
+- Eager driver / host loop: `v1/worker/gpu_model_runner.py` execute_model 4005, _model_forward 3718,
+  _prepare_inputs 1872, _determine_batch_execution_and_padding 3768, sample_tokens 4357,
+  synchronize_input_prep 3704; `v1/worker/block_table.py`; `v1/worker/gpu_input_batch.py:184-205`.
+- Attention: `v1/attention/backends/flash_attn.py` (forward 667-848, varlen call 796-818, builder
+  388-575, update_block_table 577-586); `model_executor/layers/mamba/gdn/qwen_gdn_linear_attn.py`
+  (decode 1644-1696); `model_executor/layers/fla/ops/fused_recurrent.py` (kernel 255-336).
+- GEMM: `model_executor/kernels/linear/nvfp4/flashinfer.py:56-89`;
+  `model_executor/layers/quantization/modelopt.py` (NvFp4 LinearMethod 1103-1232, MoE 1381-1666);
+  `model_executor/layers/fused_moe/experts/marlin_moe.py` (59-225, 227-360, 732-895);
+  `compilation/passes/fusion/rms_quant_fusion.py:98`, `act_quant_fusion.py:40,128`.
+- Sampling: `v1/sample/sampler.py:72-302`; `v1/sample/ops/topk_topp_sampler.py:55,460-497`;
+  `v1/sample/metadata.py`; `v1/engine/output_processor.py`.
+- Config: `config/scheduler.py:146,168-176` (async_scheduling default -> sync Scheduler).
+- Evidence: `~/bench/h2h_dense_vllm.log`, `~/bench/h2h_moe_vllm.log`, `~/bench/decode_study/cat2.py`
+  over `srv_decode2.sqlite`; this worktree `QWEN36_NVFP4_BENCH.md`, `DECODE_GAP_STUDY.md`,
+  `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`.
+</content>
+</invoke>