The S1 section-(a) padded/fixed-slot decode shape (the scoped follow-up to push serving graph reuse from ~72% toward ~100%) was implemented in an isolated worktree off the committed S1/S3/tail base, built CUDA-only, and benched on GB10. Verdict: REJECTED. It is bit-exact and provably inert, but it regresses serving throughput at every concurrency and does not close the vLLM gap. Implementation (default-off, LLAMA_PAGED_PAD_DECODE): on a pure-decode step (n_prompt_budgeted == 0) emit a masked-inert dummy decode for every idle slot so n_tokens / n_seqs / n_seqs_unq / n_outputs and the seq-id set stay constant; a release()-side guard keeps a finished slot warm under padding. Each dummy is its own sequence (private recurrent state, per-stream paged attention, logits discarded), so it cannot perturb a real stream. Gates: single-seq greedy md5 bit-exact (dense 5951a5b4, paged-MoE 8cb0ce23). The literal per-stream ON-vs-OFF identity gate is unachievable - concurrent cuBLAS/FA decode is not bit-reproducible run-to-run even with padding off (OFF-vs-OFF diverging streams: dense 3/16, MoE 8/16). The achievable inertness gate passed: ON-vs-OFF per-stream prefix-agreement equals the OFF-vs-OFF noise floor exactly (MoE 0.940/0.940, dense 0.812/0.812), so the dummy slots leak nothing. Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10), burst decode tok/s/seq: n=8 S1+S3 28.16 / PAD 6.05 / vLLM 44.8; n=128 S1+S3 4.53 / PAD 4.32 / vLLM 6.87. Staggered aggregate tok/s: baseline (reuse 0%) 757.6, S1+S3 (reuse 72%) 763.3, PAD (reuse 38%) 558.0. Why it fails: (1) serving decode here is GPU-compute-bound, not host-rebuild-bound - baseline reuse 0% ~= S1+S3 reuse 72% on aggregate tok/s, so closing reuse buys ~nothing (the earlier 542->762 host-bound delta did not reproduce); (2) padding adds dummy-row compute proportional to pad_width - real_load, catastrophic at low load; (3) in continuous serving padding cannot hold a constant width (perpetual prefill churn) so reuse drops 72% -> 38%; (4) the completion-driven batch shrink padding prevents is itself a throughput win in a compute-bound regime. The residual burst gap is GPU-compute, which a host-side reuse lever cannot close. Patch series unchanged: this rejected lever is NOT added to patches/paged/. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
24 KiB
DECODE_SERVING_SCOPE - the continuous-serving decode gap
Status: S1 + S3 IMPLEMENTED, GPU-validated, bit-exact, shipped as patches 0040 (S1) + 0041 (S3). S2 DROPPED (measured non-target). See the results block below; the rest of this doc is the design/rationale those patches implement.
Results (GB10, measured)
Phase 0 confirmed host-bound: serving graph reuse 0% over ~5k steps (layer-A
rebuilds every step), hostproc 3.44 ms/step vs 1.59 static - the +1.85 ms IS the
graph rebuild; set_inputs 0.047 ms and block-table 0.002 ms are negligible.
- S1 (patch 0040) - root cause: the paged decode inputs never overrode
can_reuse(defaults false), so the graph could never be reused. Fixed with a 256-bucketed-shapecan_reuse+ live-mctx refresh. Static batched-bench A/B: paged decode reuse 0% -> 95.5%, bit-exact (md5 byte-identical reuse on/off). Necessary but not sufficient in serving (13.8% reuse alone - prefill co-batching churns the shape). - S3 (patch 0041) - keeps prefill out of decode steps so the scheduler emits
reuse-stable pure-decode steps. S1+S3 together (128-client staggered serving,
MoE Qwen3.6-35B-A3B-NVFP4): reuse 0% -> 72.2%,
hostproc15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9). - S2 (double-buffer set_inputs) - DROPPED. Phase 0 put
set_inputsat ~0.05 ms/step: it is not the cost (the rebuild is), so S2 has nothing to recover. - Follow-up to ~100% reuse - PADDED/FIXED-SLOT DECODE SHAPE: IMPLEMENTED, GPU-TESTED, REJECTED (not shipped). See the "Padded-shape lever - rejected" block below. Summary: it does NOT close the serving gap. Padding holds the pure-decode width constant by emitting masked-inert dummy decodes for idle slots, and it is provably inert (single-seq md5 bit-exact + per-stream noise-floor determinism), but it regresses throughput at every concurrency (catastrophically at low load) because the serving decode here is GPU-compute-bound, not host-rebuild-bound - so the dummy-row compute it adds costs more than the graph-reuse it recovers. The original "remaining ~28% is request-boundary churn -> pad it" hypothesis stands mechanically, but the payoff premise (closing reuse pulls decode toward vLLM) is not supported by measurement.
Padded-shape lever - rejected (implemented + GPU-tested, 2026-06-28)
The S1 section-(a) padded / fixed-slot decode shape was implemented in an
isolated worktree off the committed S1/S3/tail base (paged HEAD 05eceb4), built
CUDA-only, and benched on GB10. Verdict: REJECTED - it regresses serving
throughput and does not close the vLLM gap. Recorded here so it is not re-tried.
Implementation (default-off, LLAMA_PAGED_PAD_DECODE=1; LLAMA_PAGED_PAD_WIDTH
caps the slot range): at the end of pre_decode(), on any step where no prompt
tokens were admitted (n_prompt_budgeted == 0) and there is decode load, emit a
masked-inert dummy decode for every IDLE slot (batch.add(slot.id, 0, pos_max+1, /*output=*/true); cold slot -> fresh pos-0). This holds n_tokens,
n_seqs, n_seqs_unq, n_outputs and the participating seq-id SET constant
across arrivals/completions. A release()-side guard keeps a finished slot warm
under padding (else patch 0024's reclaim-on-idle frees its KV and the next-step
pos-0 re-warm churns paged-block allocation, destroying reuse). Each dummy is its
OWN sequence, so its recurrent (gated-DeltaNet) state is private and its paged
attention reads only its own cells; its logits are computed but never read
(post_decode() only consumes slot.i_batch of GENERATING slots).
Gates. (1) Single-seq greedy md5 bit-exact PASS - dense
5951a5b4d624ce891e22ab5fca9bc439, paged-MoE 8cb0ce23777bf55f92f63d0292c756b0
(the lever lives only in llama-server's update_slots(), never in
llama-completion). (2) Per-stream serving determinism: the literal
"ON-vs-OFF token sequences identical" gate is unachievable - concurrent
cuBLAS/FA decode is not bit-reproducible run-to-run even with padding OFF
(OFF-vs-OFF diverging streams: dense 3/16, MoE 8/16, lockstep K=16). The
achievable inertness gate PASSED: per-stream prefix-agreement ON-vs-OFF equals
the OFF-vs-OFF noise floor exactly (MoE 0.940/0.940, dense 0.812/0.812), i.e. the
dummy slots inject no systematic divergence beyond the pre-existing concurrent FP
noise. So padding is provably inert; it just does not help.
Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10). Burst h2h, decode tok/s/seq:
| n | S1+S3 | PAD | vLLM |
|---|---|---|---|
| 8 | 28.16 | 6.05 | 44.8 |
| 32 | 11.66 | 4.84 | 17.45 |
| 64 | 7.16 | 4.33 | 11.07 |
| 128 | 4.53 | 4.32 | 6.87 |
Staggered (serve_bench.py k=128 n=160 stagger0.25), aggregate decode tok/s and
graph-reuse: baseline (reuse 0%) 757.6; S1+S3 (reuse 72%) 763.3; PAD
(reuse 38%) 558.0.
Why it fails (four independent reasons):
- Serving decode is GPU-compute-bound, not host-rebuild-bound (this run).
Baseline reuse 0% (757.6 agg) is statistically equal to S1+S3 reuse 72% (763.3
agg):
hostprocis only ~4-8% of the per-step wall, so eliminating the host graph rebuild buys ~nothing. (This corrects the host-bound hypothesis above for this hardware: the earlier 542->762 host-bound delta did not reproduce- it was GPU-state/contention variance, not a stable reuse effect.)
- Padding ADDS dummy-row compute (full-width decode), costing throughput in
direct proportion to
pad_width - real_load: catastrophic at low concurrency (n=8: 28.16 -> 6.05, ~4.6x slower, because 8 real streams pay for a 128-wide step). - In continuous serving padding can't even hold the width constant: arrivals are perpetually mid-prefill, so the idle-slot count varies and reuse DROPS 72% -> 38% (the opposite of the goal). It only stabilises the pure-decode tail of a burst (verified: width pinned at 64 as real decoders fell 49->5), which is exactly where the dummy compute is most wasteful.
- The completion-driven batch shrink that padding prevents is itself a throughput WIN in a compute-bound regime (fewer real streams -> cheaper steps -> survivors finish faster); forcing constant width forfeits it.
Conclusion. The residual burst gap (paged 4.53 vs vLLM 6.87 at n=128 ~= 66%)
is a GPU-compute gap (vLLM's MoE decode kernel + scheduler are ~1.3x faster on
aggregate), not a host-loop gap. A host-side graph-reuse lever cannot close it.
Do not re-pursue padded/fixed-slot shapes for throughput; if the host loop is ever
re-confirmed dominant on other hardware (re-run reason 1's baseline-vs-S1+S3 A/B
first), revisit - but only with an adaptive width matched to live load, never a
fixed pad-to---parallel.
Per the
"profile-don't-assume" rule in
.agents/vllm-parity-methodology.md,
Phase 0 (section 5) is to confirm the bottleneck on GPU before touching any
code. Everything below the Phase-0 line is a hypothesis ranked by
value/effort/risk, not a measured result.
Regime warning (read first). Every "decode is at the BW floor / ties vLLM" and "host scheduling loop is the structural residual" conclusion in
README.mdsection 5 was measured withllama-batched-bench: a STATIC serving width (fixednpl, all sequences in lockstep, constant batch shape every step). That is the decode KERNEL regime, and there the patch series is at parity (paged ~6.1 tok/s/seq vs vLLM ~5.9 at npl128). This document is about a different regime: real continuous SERVING throughllama-server'supdate_slots()loop, where requests arrive and complete asynchronously, the batch shape churns every step, and paged drops to ~3.7 tok/s/seq (-39%) while vLLM sustains ~5.9. The gap is the scheduler / host loop, not the kernel. This is the serving analogue of the prefill-GEMM regime split called out inPREFILL_GEMM_SCOPE.md.
Cross-links: README.md sections 2 (scheduler), 3 (patches
0008/0013/0016/0024/0025/0029), 5 (rejected levers - lever 2 graph coverage was
FLAT in the static regime; this doc reopens it for the serving regime);
.agents/llama-cpp-localai-paged-backend.md
(bit-exact gate);
.agents/vllm-parity-methodology.md
(both-engine ground-truth, per-lever A/B, record-rejected-levers).
1. The two regimes, and why the kernel-parity result does not carry over
llama-batched-bench and a real serving workload exercise the same decode
kernels but different host loops:
llama-batched-bench (kernel regime) |
llama-server continuous serving |
|
|---|---|---|
| batch shape per step | constant (fixed npl, lockstep) |
churns (arrivals/completions, interleaved prefill) |
| participating seq-set | fixed for the whole run | changes as requests start/finish |
| graph reuse (see s.2) | holds after warmup -> 1 capture, replayed | breaks nearly every step -> rebuild + re-capture |
| measured | paged ~6.1 tok/s/seq ~ vLLM ~5.9 | paged ~3.7 vs vLLM ~5.9 (-39%) |
The README's decode parity, BW-floor, and "host loop is the irreducible residual" findings are all kernel-regime findings. They prove the kernels are not the serving gap. They do not prove the host loop is irreducible in serving - the static bench holds the batch shape constant, which is exactly the condition that lets both graph-reuse layers (section 2) stay hot. Serving violates that condition. So the serving gap is reopened here as a host / scheduler problem, orthogonal to the kernel.
2. Root-cause hypothesis (from source, pin 9d5d882d + the dev tree)
There are two independent graph-reuse layers, and continuous batching breaks both on nearly every step. This is the leading hypothesis for the -39%.
2a. Layer A - llama-context graph reuse (can_reuse / allow_reuse)
llama_context::process_ubatch (src/llama-context.cpp ~L1366) only reuses
the built ggml graph when res->can_reuse(gparams) holds. allow_reuse
(src/llama-graph.h ~L631) requires, among others:
ubatch.n_tokens == other.ubatch.n_tokens &&
ubatch.n_seqs == other.ubatch.n_seqs &&
ubatch.n_seqs_unq == other.ubatch.n_seqs_unq &&
ubatch.equal_seqs() == other.ubatch.equal_seqs()
// + (when equal_seqs) the participating sequence-id SET must match
In serving, n_tokens changes whenever the decode load D changes or a prefill
chunk is co-batched, and the sequence-id set changes whenever a request
starts or finishes. Either makes can_reuse return false, so process_ubatch
falls into the else branch: rebuild the graph (model.build_graph) +
ggml_backend_sched_reset + ggml_backend_sched_alloc_graph - full host-side
graph construction + allocation, every step. In batched-bench all sequences
are lockstep so n_tokens/seq-set are constant and can_reuse is true after
warmup (the graphs reused = N perf line is ~all steps).
2b. Layer B - CUDA graph capture (ggml_cuda_graph_*)
Even when layer A reuses, the CUDA backend re-checks
ggml_cuda_graph_update_required (ggml-cuda.cu ~L3367): it memcmps every
node's ne, nb, and src[]->data pointers against the captured graph. Any
shape change -> cudaGraphExecUpdate / re-instantiate. Two serving-specific
triggers:
- shape churn (same root cause as layer A): different
n_tokens-> different nodene-> update required. - paged data-pointer churn: when a co-batched prefill allocates new KV blocks
(or a finished sequence frees them), the per-step KV view tensors'
datapointers move, so even a constant-shape decode step can trip thememcmp. (The block-table contents live in a fixed device buffer filled byset_inputs, so the table tensor pointer itself is stable - 0029 keeps that cheap - but the K/V cache views are not.)
Net: under serving, the GPU sits idle between launches while the host rebuilds
the graph (layer A) and re-instantiates the CUDA graph (layer B), then runs an
un-graphed set_inputs (H2D input copies) before each launch. vLLM avoids this
with padded/bucketed decode batch shapes + piecewise CUDA graphs: it pads the
decode batch to a fixed set of sizes and captures one persistent graph per
bucket, so the steady-state decode step is a single cudaGraphLaunch with no
host rebuild. Its scheduler is also a tight C++ loop with chunked-prefill
interleave that keeps the GPU fed.
2c. Per-step host work that runs un-graphed regardless (already instrumented)
The dev tree carries a built-in [L5INSTR] profiler (src/paged-attn.cpp,
hooks in src/llama-context.cpp and src/llama-kv-cache.cpp) that already
isolates the host buckets we care about, printed at process exit:
[L5INSTR] get_block_table n=.. sum=..ms mean=..ms | set_inputs n=.. mean=..ms | hostproc n=.. mean=..ms
hostproc=mctx->apply()+ graph reuse-check/rebuild +set_inputs, i.e. the whole host window beforegraph_compute(it does NOT include the GPU launch). Prior profiles put this near ~1.4 ms/step.set_inputs= the H2D input fills (positions, masks, block table, idxs).get_block_table= the paged block-table host build (0029 caches it within-step;LLAMA_PAGED_NO_BT_CACHEA/B-toggles that).
If hostproc per step is a large fraction of the serving per-step wall time
(and the graphs reused count is low), the gap is host-bound, not kernel-bound.
2d. The serial-SSM host loop (named in README s.5, secondary here)
The gated-DeltaNet decode advances recurrent state per step; sampling cannot start until logits land. The README already names this as a structural floor in the kernel regime. It is the same in serving but is the smaller term - the graph-rebuild/re-capture overhead (2a/2b) is the new, serving-specific cost the static bench hides, and it is the one to attack first.
3. What the already-shipped scheduler patches do (and do NOT do)
These exist; understand them before proposing anything. None of them touch the two graph-reuse layers - they target prefill freezing and burst collapse, not steady-state decode-step host overhead. That is why the serving gap survives them.
| Patch | What it does | What it does NOT do |
|---|---|---|
| 0008 cross-request prefix-share (server loop) | Concurrent shared-prefix requests prefill only the divergent suffix (fewer prefill tokens). | Does not stabilise decode batch shape; does not graph-reuse. |
0013 LLAMA_PREFILL_BUDGET |
Static per-step prefill-token cap (vLLM --max-num-batched-tokens analogue); flattens the ITL spike a long prefill inflicts on co-batched decode. |
Ignores decode load; per-workload tuning; no effect on decode-step graph reuse. |
| 0016 dynamic decode-first budget | max(n_ubatch, T-D) leftover-after-decode budget + per-slot chunk cap; decode claimed first, auto-shrinks as D rises. Stops a prefill chunk from inflating the step past T. |
Still lets the per-step decode n_tokens and seq-set vary, so it does not make the decode step graph-reusable; it shapes prefill admission, not decode-shape stability. |
| 0024 paged-pool burst-reclaim | Truncate/defrag/release KV blocks; fixes long-server prefill burst collapse (488->44->532 t/s). | Host accounting only; nothing about decode-step graph capture. |
0025 LLAMA_MOE_FORCE_GRAPHS |
Keeps CUDA graphs ON for the grouped-MMQ MoE decode step (lifts the conservative MUL_MAT_ID graph-disable). |
Helps the CUDA-graph eligibility of one op; does not make layer-A/B reuse hold across churning steps. It is necessary-not-sufficient: a step that rebuilds anyway gets recaptured regardless. |
| 0029 block-table within-step cache | get_block_table computed once per step, memcpy'd to other full-attn layers (-87/-91%). |
Shrinks one set_inputs/hostproc sub-term; does not address rebuild/re-capture. |
README s.5 "lever 2 (graph/stream coverage): FLAT" was concluded in the static batched-bench regime, where graphs already reuse - so more graph coverage was correctly a no-op there. That conclusion does not apply to the serving regime, where graphs do not reuse. This doc reopens graph coverage for serving only; record it as a regime-scoped reopening, not a contradiction.
4. Ranked lever plan (hypotheses - gate on Phase 0 first)
Ranked by value/effort with bit-exactness/risk called out. All are host-side / scheduler levers (no decode-kernel changes), so all are bit-exact-safe by construction provided padding tokens are masked-inert and verified against the per-path md5 gate.
Lever S1 (TOP) - bucketed/padded decode-step shape for graph reuse
Value: high (targets the dominant -39% mechanism). Effort: medium-high. Risk: medium (correctness of padding inertness; seq-set churn is harder than n_tokens).
Make the steady-state decode step present a stable, bucketed shape to both reuse layers, mirroring vLLM's padded decode batch + piecewise CUDA graphs:
- Pad the per-step decode
n_tokens(and the stream/seq count the graph sees) up to the next bucket in a small fixed set (e.g. {power-of-two or fixed grid}), soallow_reuse(layer A) andupdate_required(layer B) hold across steps with the same bucket. Padding tokens are dummy, masked positions that contribute nothing to any real sequence's logits. - Bound the number of distinct live buckets so a handful of persistent CUDA graphs cover steady decode (vLLM captures ~tens).
- Handle the seq-set component of
allow_reuse: bucketingn_tokensalone is insufficient because the participating sequence-id set must also match. Either (a) pad to a fixed stream-slot layout so the seq-set is stable across arrivals /completions, or (b) relax/extend the reuse key so a pure-decode step keyed on bucket+slot-layout reuses regardless of which slots are occupied. (b) is the higher-leverage but more invasive option.
Bit-exact gate: greedy md5 per path with padding ON must equal the recorded
references (5951a5b4 dense, 8cb0ce23 paged-MoE); test-backend-ops
unaffected (no op changes). The risk is that masked/padded positions leak into a
real logit (off-by-one in the mask) - the md5 gate catches it.
Lever S2 - overlap per-step host work with GPU decode (double-buffer inputs)
Value: medium-high (recovers the hostproc window even when S1 partial).
Effort: medium. Risk: low (host-side reordering only, bit-exact-safe).
Even with graphs reused, set_inputs (+ the pre-set_inputs sync) runs
un-graphed and serially before each launch (hostproc ~1.4 ms/step in prior
profiles). Overlap the host scheduling + input build of step N+1 with the GPU
decode of step N: double-buffer the input device tensors so the host can fill
N+1's inputs while N's graph is in flight, and prepare the next ubatch / block
table on the host concurrently. This is the llama.cpp analogue of vLLM keeping
the GPU fed. Strictly host-side, no numeric change -> bit-exact. (0029 already
banks part of this for the block table within a step; S2 extends it across
steps.)
Lever S3 - graph-shape-stable scheduling (bridge from 0016)
Value: medium (multiplies S1; low marginal value without S1). Effort: low-medium (extends the existing 0016 policy). Risk: low (scheduler policy, bit-exact when the decode result is unchanged).
Extend the existing decode-first budget (0016) so the scheduler actively prefers graph-reusable steps: keep prefill chunks out of the decode step (run prefill in its own steps, or at a fixed chunk size) so the decode batch shape stays on a bucket rather than being perturbed by interleaved prefill tokens every step. This is the policy half of S1 - S1 makes a bucketed step reusable; S3 makes the scheduler emit bucketed steps. Pair them.
Rejected/deferred (record so they are not re-tried):
- More CUDA-graph coverage alone (the README lever-2 redo): still FLAT without S1. Forcing more ops graph-eligible (beyond 0025) does nothing while layer A rebuilds the graph every step - the recapture dominates. Only valuable after S1 makes reuse hold.
GGML_CUDA_DISABLE_GRAPHS/ disabling graphs in serving: REJECTED a priori as a fix (it is an A/B probe for Phase 0, not a lever) - it removes capture cost but also removes replay benefit; expected net-negative.- Precision levers (W4A16, bf16-SSM): out of scope - this gap is host-bound, not GEMM/BW-bound (see README s.5 rejections; do not reopen).
5. Phase 0 - confirm it is host-bound BEFORE building (run when the GPU frees)
Do NOT build any lever until this confirms host-bound. The dev tree already has all the instrumentation; this is a measurement, not a code change. One GPU bencher at a time (GPU-contention rule).
Workload. Real continuous serving, not batched-bench: run llama-server
(paged build) with the paged config and drive it with a steady concurrent
streaming load (e.g. a K-client async generator hitting /completion with
staggered arrivals so requests start/finish asynchronously - the regime
batched-bench cannot produce). Use the same models/flags as README s.4:
-fa on -ngl 99, LLAMA_KV_PAGED=1 (+ LLAMA_MOE_FORCE_GRAPHS=1 for MoE),
dense Qwen3.6-27B-NVFP4 and MoE Qwen3.6-35B-A3B-NVFP4. Pick K so the effective
decode width matches a static npl you have a kernel-regime number for (e.g.
~128) - that gives the apples comparison: static 6.1 vs serving 3.7 tok/s/seq.
Signals to capture (all already exist):
- Graph reuse rate. The
graphs reused = Nperf line (llama-context.cpp~L4146, fromdata.n_reused) over total decode steps. Hypothesis: ~100% in batched-bench, near 0% in serving. This is the single most decisive number. A/B withLLAMA_GRAPH_REUSE_DISABLE=1(forces the rebuild path) - if serving is already near that floor, layer-A reuse is the gap. [L5INSTR]host buckets (printed at exit):hostproc,set_inputs,get_block_tablemean ms/step. Compare serving vs batched-bench. A/B the block-table cache withLLAMA_PAGED_NO_BT_CACHE.- GPU-busy % in a steady-state serving window via nsys (sum of kernel
durations / wall) and the inter-launch host gap (time between consecutive
cudaGraphLaunch/kernel launches). Hypothesis: batched-bench ~96-99% busy (README/methodology note the early "low util" was a window artifact); serving materially lower, with the gap ~=hostproc/step. Watch the same window artifact the methodology warns about - measure a clean steady-state span. - CUDA-graph re-instantiation count - confirm layer B is also re-capturing
(nsys shows
cudaGraphInstantiate/cudaGraphExecUpdateper step, or add a host-side counter print - host-side only, no kernel code).
Decision rule. Host-bound (proceed with S1/S2/S3) if: serving graphs reused
is low AND hostproc/step is a large fraction of serving per-step wall AND
GPU-busy% drops vs batched-bench by ~the observed throughput ratio (~3.7/6.1).
If instead GPU-busy% stays high and per-kernel time grows, the cause is
elsewhere (e.g. serving runs a worse effective batch shape into the kernels) -
re-scope before building.
Ground-truth vLLM (both-engine rule). Capture vLLM at the same concurrency: GPU-busy% / step cadence (nsys) and its scheduler step time. Confirm vLLM stays GPU-bound (persistent graphs) where paged goes host-bound - that is the direct evidence the gap is the host loop, and it sizes the achievable win.
6. Summary
- The serving gap (paged 3.7 vs vLLM 5.9 tok/s/seq, -39%) is a host/scheduler problem, distinct from the decode kernel (at parity in batched-bench). The README's BW-floor/host-loop-residual findings are kernel-regime and do not bound the serving regime.
- Leading mechanism: continuous batching's batch-shape + seq-set churn breaks
both graph-reuse layers (llama-context
can_reuse, CUDAupdate_required) every step, so the GPU idles while the host rebuilds + re-captures + runs un-graphedset_inputs. vLLM avoids this with padded/bucketed decode shapes + piecewise CUDA graphs. - The shipped scheduler patches (0008/0013/0016/0024/0025/0029) target prefill freezing + burst collapse, not decode-step graph reuse - which is why the serving gap survives them.
- Top levers (all host-side, bit-exact-safe): S1 bucketed/padded decode-step
shape for graph reuse, S2 double-buffer/overlap per-step host work, S3
graph-shape-stable scheduling (extend 0016). Gate everything on Phase 0:
the
graphs reusedrate +[L5INSTR]host buckets + nsys GPU-busy% in realllama-serverserving vs batched-bench, with vLLM ground-truthed at the same concurrency.