mirror of https://github.com/mudler/LocalAI.git synced 2026-06-30 11:26:32 -04:00

Files

Ettore Di Giacinto b028c81eda docs(paged): record padded/fixed-slot decode shape as tested-and-rejected

The S1 section-(a) padded/fixed-slot decode shape (the scoped follow-up to push
serving graph reuse from ~72% toward ~100%) was implemented in an isolated
worktree off the committed S1/S3/tail base, built CUDA-only, and benched on GB10.
Verdict: REJECTED. It is bit-exact and provably inert, but it regresses serving
throughput at every concurrency and does not close the vLLM gap.

Implementation (default-off, LLAMA_PAGED_PAD_DECODE): on a pure-decode step
(n_prompt_budgeted == 0) emit a masked-inert dummy decode for every idle slot so
n_tokens / n_seqs / n_seqs_unq / n_outputs and the seq-id set stay constant; a
release()-side guard keeps a finished slot warm under padding. Each dummy is its
own sequence (private recurrent state, per-stream paged attention, logits
discarded), so it cannot perturb a real stream.

Gates: single-seq greedy md5 bit-exact (dense 5951a5b4, paged-MoE 8cb0ce23). The
literal per-stream ON-vs-OFF identity gate is unachievable - concurrent cuBLAS/FA
decode is not bit-reproducible run-to-run even with padding off (OFF-vs-OFF
diverging streams: dense 3/16, MoE 8/16). The achievable inertness gate passed:
ON-vs-OFF per-stream prefix-agreement equals the OFF-vs-OFF noise floor exactly
(MoE 0.940/0.940, dense 0.812/0.812), so the dummy slots leak nothing.

Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10), burst decode tok/s/seq: n=8 S1+S3 28.16
/ PAD 6.05 / vLLM 44.8; n=128 S1+S3 4.53 / PAD 4.32 / vLLM 6.87. Staggered
aggregate tok/s: baseline (reuse 0%) 757.6, S1+S3 (reuse 72%) 763.3, PAD
(reuse 38%) 558.0.

Why it fails: (1) serving decode here is GPU-compute-bound, not host-rebuild-bound
- baseline reuse 0% ~= S1+S3 reuse 72% on aggregate tok/s, so closing reuse buys
~nothing (the earlier 542->762 host-bound delta did not reproduce); (2) padding
adds dummy-row compute proportional to pad_width - real_load, catastrophic at low
load; (3) in continuous serving padding cannot hold a constant width (perpetual
prefill churn) so reuse drops 72% -> 38%; (4) the completion-driven batch shrink
padding prevents is itself a throughput win in a compute-bound regime. The
residual burst gap is GPU-compute, which a host-side reuse lever cannot close.

Patch series unchanged: this rejected lever is NOT added to patches/paged/.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-28 20:47:43 +00:00

24 KiB

Raw Blame History

DECODE_SERVING_SCOPE - the continuous-serving decode gap

Status: S1 + S3 IMPLEMENTED, GPU-validated, bit-exact, shipped as patches 0040 (S1) + 0041 (S3). S2 DROPPED (measured non-target). See the results block below; the rest of this doc is the design/rationale those patches implement.

Results (GB10, measured)

Phase 0 confirmed host-bound: serving graph reuse 0% over ~5k steps (layer-A rebuilds every step), hostproc 3.44 ms/step vs 1.59 static - the +1.85 ms IS the graph rebuild; set_inputs 0.047 ms and block-table 0.002 ms are negligible.

S1 (patch 0040) - root cause: the paged decode inputs never overrode can_reuse (defaults false), so the graph could never be reused. Fixed with a 256-bucketed-shape can_reuse + live-mctx refresh. Static batched-bench A/B: paged decode reuse 0% -> 95.5%, bit-exact (md5 byte-identical reuse on/off). Necessary but not sufficient in serving (13.8% reuse alone - prefill co-batching churns the shape).
S3 (patch 0041) - keeps prefill out of decode steps so the scheduler emits reuse-stable pure-decode steps. S1+S3 together (128-client staggered serving, MoE Qwen3.6-35B-A3B-NVFP4): reuse 0% -> 72.2%, hostproc 15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9).
S2 (double-buffer set_inputs) - DROPPED. Phase 0 put set_inputs at ~0.05 ms/step: it is not the cost (the rebuild is), so S2 has nothing to recover.
Follow-up to ~100% reuse - PADDED/FIXED-SLOT DECODE SHAPE: IMPLEMENTED, GPU-TESTED, REJECTED (not shipped). See the "Padded-shape lever - rejected" block below. Summary: it does NOT close the serving gap. Padding holds the pure-decode width constant by emitting masked-inert dummy decodes for idle slots, and it is provably inert (single-seq md5 bit-exact + per-stream noise-floor determinism), but it regresses throughput at every concurrency (catastrophically at low load) because the serving decode here is GPU-compute-bound, not host-rebuild-bound - so the dummy-row compute it adds costs more than the graph-reuse it recovers. The original "remaining ~28% is request-boundary churn -> pad it" hypothesis stands mechanically, but the payoff premise (closing reuse pulls decode toward vLLM) is not supported by measurement.

Padded-shape lever - rejected (implemented + GPU-tested, 2026-06-28)

The S1 section-(a) padded / fixed-slot decode shape was implemented in an isolated worktree off the committed S1/S3/tail base (paged HEAD 05eceb4), built CUDA-only, and benched on GB10. Verdict: REJECTED - it regresses serving throughput and does not close the vLLM gap. Recorded here so it is not re-tried.

Implementation (default-off, LLAMA_PAGED_PAD_DECODE=1; LLAMA_PAGED_PAD_WIDTH caps the slot range): at the end of pre_decode(), on any step where no prompt tokens were admitted (n_prompt_budgeted == 0) and there is decode load, emit a masked-inert dummy decode for every IDLE slot (batch.add(slot.id, 0, pos_max+1, /*output=*/true); cold slot -> fresh pos-0). This holds n_tokens, n_seqs, n_seqs_unq, n_outputs and the participating seq-id SET constant across arrivals/completions. A release()-side guard keeps a finished slot warm under padding (else patch 0024's reclaim-on-idle frees its KV and the next-step pos-0 re-warm churns paged-block allocation, destroying reuse). Each dummy is its OWN sequence, so its recurrent (gated-DeltaNet) state is private and its paged attention reads only its own cells; its logits are computed but never read (post_decode() only consumes slot.i_batch of GENERATING slots).

Gates. (1) Single-seq greedy md5 bit-exact PASS - dense 5951a5b4d624ce891e22ab5fca9bc439, paged-MoE 8cb0ce23777bf55f92f63d0292c756b0 (the lever lives only in llama-server's update_slots(), never in llama-completion). (2) Per-stream serving determinism: the literal "ON-vs-OFF token sequences identical" gate is unachievable - concurrent cuBLAS/FA decode is not bit-reproducible run-to-run even with padding OFF (OFF-vs-OFF diverging streams: dense 3/16, MoE 8/16, lockstep K=16). The achievable inertness gate PASSED: per-stream prefix-agreement ON-vs-OFF equals the OFF-vs-OFF noise floor exactly (MoE 0.940/0.940, dense 0.812/0.812), i.e. the dummy slots inject no systematic divergence beyond the pre-existing concurrent FP noise. So padding is provably inert; it just does not help.

Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10). Burst h2h, decode tok/s/seq:

n	S1+S3	PAD	vLLM
8	28.16	6.05	44.8
32	11.66	4.84	17.45
64	7.16	4.33	11.07
128	4.53	4.32	6.87

Staggered (serve_bench.py k=128 n=160 stagger0.25), aggregate decode tok/s and graph-reuse: baseline (reuse 0%) 757.6; S1+S3 (reuse 72%) 763.3; PAD (reuse 38%) 558.0.

Why it fails (four independent reasons):

Serving decode is GPU-compute-bound, not host-rebuild-bound (this run). Baseline reuse 0% (757.6 agg) is statistically equal to S1+S3 reuse 72% (763.3 agg): hostproc is only ~4-8% of the per-step wall, so eliminating the host graph rebuild buys ~nothing. (This corrects the host-bound hypothesis above for this hardware: the earlier 542->762 host-bound delta did not reproduce
- it was GPU-state/contention variance, not a stable reuse effect.)
Padding ADDS dummy-row compute (full-width decode), costing throughput in direct proportion to pad_width - real_load: catastrophic at low concurrency (n=8: 28.16 -> 6.05, ~4.6x slower, because 8 real streams pay for a 128-wide step).
In continuous serving padding can't even hold the width constant: arrivals are perpetually mid-prefill, so the idle-slot count varies and reuse DROPS 72% -> 38% (the opposite of the goal). It only stabilises the pure-decode tail of a burst (verified: width pinned at 64 as real decoders fell 49->5), which is exactly where the dummy compute is most wasteful.
The completion-driven batch shrink that padding prevents is itself a throughput WIN in a compute-bound regime (fewer real streams -> cheaper steps -> survivors finish faster); forcing constant width forfeits it.

Conclusion. The residual burst gap (paged 4.53 vs vLLM 6.87 at n=128 ~= 66%) is a GPU-compute gap (vLLM's MoE decode kernel + scheduler are ~1.3x faster on aggregate), not a host-loop gap. A host-side graph-reuse lever cannot close it. Do not re-pursue padded/fixed-slot shapes for throughput; if the host loop is ever re-confirmed dominant on other hardware (re-run reason 1's baseline-vs-S1+S3 A/B first), revisit - but only with an adaptive width matched to live load, never a fixed pad-to---parallel.

Per the "profile-don't-assume" rule in .agents/vllm-parity-methodology.md, Phase 0 (section 5) is to confirm the bottleneck on GPU before touching any code. Everything below the Phase-0 line is a hypothesis ranked by value/effort/risk, not a measured result.

Regime warning (read first). Every "decode is at the BW floor / ties vLLM" and "host scheduling loop is the structural residual" conclusion in README.md section 5 was measured with llama-batched-bench: a STATIC serving width (fixed npl, all sequences in lockstep, constant batch shape every step). That is the decode KERNEL regime, and there the patch series is at parity (paged ~6.1 tok/s/seq vs vLLM ~5.9 at npl128). This document is about a different regime: real continuous SERVING through llama-server's update_slots() loop, where requests arrive and complete asynchronously, the batch shape churns every step, and paged drops to ~3.7 tok/s/seq (-39%) while vLLM sustains ~5.9. The gap is the scheduler / host loop, not the kernel. This is the serving analogue of the prefill-GEMM regime split called out in PREFILL_GEMM_SCOPE.md.

Cross-links: README.md sections 2 (scheduler), 3 (patches 0008/0013/0016/0024/0025/0029), 5 (rejected levers - lever 2 graph coverage was FLAT in the static regime; this doc reopens it for the serving regime); .agents/llama-cpp-localai-paged-backend.md (bit-exact gate); .agents/vllm-parity-methodology.md (both-engine ground-truth, per-lever A/B, record-rejected-levers).

1. The two regimes, and why the kernel-parity result does not carry over

llama-batched-bench and a real serving workload exercise the same decode kernels but different host loops:

	`llama-batched-bench` (kernel regime)	`llama-server` continuous serving
batch shape per step	constant (fixed `npl`, lockstep)	churns (arrivals/completions, interleaved prefill)
participating seq-set	fixed for the whole run	changes as requests start/finish
graph reuse (see s.2)	holds after warmup -> 1 capture, replayed	breaks nearly every step -> rebuild + re-capture
measured	paged ~6.1 tok/s/seq ~ vLLM ~5.9	paged ~3.7 vs vLLM ~5.9 (-39%)

The README's decode parity, BW-floor, and "host loop is the irreducible residual" findings are all kernel-regime findings. They prove the kernels are not the serving gap. They do not prove the host loop is irreducible in serving - the static bench holds the batch shape constant, which is exactly the condition that lets both graph-reuse layers (section 2) stay hot. Serving violates that condition. So the serving gap is reopened here as a host / scheduler problem, orthogonal to the kernel.

2. Root-cause hypothesis (from source, pin `9d5d882d` + the dev tree)

There are two independent graph-reuse layers, and continuous batching breaks both on nearly every step. This is the leading hypothesis for the -39%.

2a. Layer A - llama-context graph reuse (`can_reuse` / `allow_reuse`)

llama_context::process_ubatch (src/llama-context.cpp ~L1366) only reuses the built ggml graph when res->can_reuse(gparams) holds. allow_reuse (src/llama-graph.h ~L631) requires, among others:

ubatch.n_tokens     == other.ubatch.n_tokens &&
ubatch.n_seqs       == other.ubatch.n_seqs   &&
ubatch.n_seqs_unq   == other.ubatch.n_seqs_unq &&
ubatch.equal_seqs() == other.ubatch.equal_seqs()
// + (when equal_seqs) the participating sequence-id SET must match

In serving, n_tokens changes whenever the decode load D changes or a prefill chunk is co-batched, and the sequence-id set changes whenever a request starts or finishes. Either makes can_reuse return false, so process_ubatch falls into the else branch: rebuild the graph (model.build_graph) + ggml_backend_sched_reset + ggml_backend_sched_alloc_graph - full host-side graph construction + allocation, every step. In batched-bench all sequences are lockstep so n_tokens/seq-set are constant and can_reuse is true after warmup (the graphs reused = N perf line is ~all steps).

2b. Layer B - CUDA graph capture (`ggml_cuda_graph_*`)

Even when layer A reuses, the CUDA backend re-checks ggml_cuda_graph_update_required (ggml-cuda.cu ~L3367): it memcmps every node's ne, nb, and src[]->data pointers against the captured graph. Any shape change -> cudaGraphExecUpdate / re-instantiate. Two serving-specific triggers:

shape churn (same root cause as layer A): different n_tokens -> different node ne -> update required.
paged data-pointer churn: when a co-batched prefill allocates new KV blocks (or a finished sequence frees them), the per-step KV view tensors' data pointers move, so even a constant-shape decode step can trip the memcmp. (The block-table contents live in a fixed device buffer filled by set_inputs, so the table tensor pointer itself is stable - 0029 keeps that cheap - but the K/V cache views are not.)

Net: under serving, the GPU sits idle between launches while the host rebuilds the graph (layer A) and re-instantiates the CUDA graph (layer B), then runs an un-graphed set_inputs (H2D input copies) before each launch. vLLM avoids this with padded/bucketed decode batch shapes + piecewise CUDA graphs: it pads the decode batch to a fixed set of sizes and captures one persistent graph per bucket, so the steady-state decode step is a single cudaGraphLaunch with no host rebuild. Its scheduler is also a tight C++ loop with chunked-prefill interleave that keeps the GPU fed.

2c. Per-step host work that runs un-graphed regardless (already instrumented)

The dev tree carries a built-in [L5INSTR] profiler (src/paged-attn.cpp, hooks in src/llama-context.cpp and src/llama-kv-cache.cpp) that already isolates the host buckets we care about, printed at process exit:

[L5INSTR] get_block_table n=.. sum=..ms mean=..ms | set_inputs n=.. mean=..ms | hostproc n=.. mean=..ms

hostproc = mctx->apply() + graph reuse-check/rebuild + set_inputs, i.e. the whole host window before graph_compute (it does NOT include the GPU launch). Prior profiles put this near ~1.4 ms/step.
set_inputs = the H2D input fills (positions, masks, block table, idxs).
get_block_table = the paged block-table host build (0029 caches it within-step; LLAMA_PAGED_NO_BT_CACHE A/B-toggles that).

If hostproc per step is a large fraction of the serving per-step wall time (and the graphs reused count is low), the gap is host-bound, not kernel-bound.

2d. The serial-SSM host loop (named in README s.5, secondary here)

The gated-DeltaNet decode advances recurrent state per step; sampling cannot start until logits land. The README already names this as a structural floor in the kernel regime. It is the same in serving but is the smaller term - the graph-rebuild/re-capture overhead (2a/2b) is the new, serving-specific cost the static bench hides, and it is the one to attack first.

3. What the already-shipped scheduler patches do (and do NOT do)

These exist; understand them before proposing anything. None of them touch the two graph-reuse layers - they target prefill freezing and burst collapse, not steady-state decode-step host overhead. That is why the serving gap survives them.

Patch	What it does	What it does NOT do
0008 cross-request prefix-share (server loop)	Concurrent shared-prefix requests prefill only the divergent suffix (fewer prefill tokens).	Does not stabilise decode batch shape; does not graph-reuse.
0013 `LLAMA_PREFILL_BUDGET`	Static per-step prefill-token cap (vLLM `--max-num-batched-tokens` analogue); flattens the ITL spike a long prefill inflicts on co-batched decode.	Ignores decode load; per-workload tuning; no effect on decode-step graph reuse.
0016 dynamic decode-first budget	`max(n_ubatch, T-D)` leftover-after-decode budget + per-slot chunk cap; decode claimed first, auto-shrinks as `D` rises. Stops a prefill chunk from inflating the step past `T`.	Still lets the per-step decode `n_tokens` and seq-set vary, so it does not make the decode step graph-reusable; it shapes prefill admission, not decode-shape stability.
0024 paged-pool burst-reclaim	Truncate/defrag/release KV blocks; fixes long-server prefill burst collapse (488->44->532 t/s).	Host accounting only; nothing about decode-step graph capture.
0025 `LLAMA_MOE_FORCE_GRAPHS`	Keeps CUDA graphs ON for the grouped-MMQ MoE decode step (lifts the conservative `MUL_MAT_ID` graph-disable).	Helps the CUDA-graph eligibility of one op; does not make layer-A/B reuse hold across churning steps. It is necessary-not-sufficient: a step that rebuilds anyway gets recaptured regardless.
0029 block-table within-step cache	`get_block_table` computed once per step, memcpy'd to other full-attn layers (-87/-91%).	Shrinks one `set_inputs`/`hostproc` sub-term; does not address rebuild/re-capture.

README s.5 "lever 2 (graph/stream coverage): FLAT" was concluded in the static batched-bench regime, where graphs already reuse - so more graph coverage was correctly a no-op there. That conclusion does not apply to the serving regime, where graphs do not reuse. This doc reopens graph coverage for serving only; record it as a regime-scoped reopening, not a contradiction.

4. Ranked lever plan (hypotheses - gate on Phase 0 first)

Ranked by value/effort with bit-exactness/risk called out. All are host-side / scheduler levers (no decode-kernel changes), so all are bit-exact-safe by construction provided padding tokens are masked-inert and verified against the per-path md5 gate.

Lever S1 (TOP) - bucketed/padded decode-step shape for graph reuse

Value: high (targets the dominant -39% mechanism). Effort: medium-high. Risk: medium (correctness of padding inertness; seq-set churn is harder than n_tokens).

Make the steady-state decode step present a stable, bucketed shape to both reuse layers, mirroring vLLM's padded decode batch + piecewise CUDA graphs:

Pad the per-step decode n_tokens (and the stream/seq count the graph sees) up to the next bucket in a small fixed set (e.g. {power-of-two or fixed grid}), so allow_reuse (layer A) and update_required (layer B) hold across steps with the same bucket. Padding tokens are dummy, masked positions that contribute nothing to any real sequence's logits.
Bound the number of distinct live buckets so a handful of persistent CUDA graphs cover steady decode (vLLM captures ~tens).
Handle the seq-set component of allow_reuse: bucketing n_tokens alone is insufficient because the participating sequence-id set must also match. Either (a) pad to a fixed stream-slot layout so the seq-set is stable across arrivals /completions, or (b) relax/extend the reuse key so a pure-decode step keyed on bucket+slot-layout reuses regardless of which slots are occupied. (b) is the higher-leverage but more invasive option.

Bit-exact gate: greedy md5 per path with padding ON must equal the recorded references (5951a5b4 dense, 8cb0ce23 paged-MoE); test-backend-ops unaffected (no op changes). The risk is that masked/padded positions leak into a real logit (off-by-one in the mask) - the md5 gate catches it.

Lever S2 - overlap per-step host work with GPU decode (double-buffer inputs)

Value: medium-high (recovers the hostproc window even when S1 partial). Effort: medium. Risk: low (host-side reordering only, bit-exact-safe).

Even with graphs reused, set_inputs (+ the pre-set_inputs sync) runs un-graphed and serially before each launch (hostproc ~1.4 ms/step in prior profiles). Overlap the host scheduling + input build of step N+1 with the GPU decode of step N: double-buffer the input device tensors so the host can fill N+1's inputs while N's graph is in flight, and prepare the next ubatch / block table on the host concurrently. This is the llama.cpp analogue of vLLM keeping the GPU fed. Strictly host-side, no numeric change -> bit-exact. (0029 already banks part of this for the block table within a step; S2 extends it across steps.)

Lever S3 - graph-shape-stable scheduling (bridge from 0016)

Value: medium (multiplies S1; low marginal value without S1). Effort: low-medium (extends the existing 0016 policy). Risk: low (scheduler policy, bit-exact when the decode result is unchanged).

Extend the existing decode-first budget (0016) so the scheduler actively prefers graph-reusable steps: keep prefill chunks out of the decode step (run prefill in its own steps, or at a fixed chunk size) so the decode batch shape stays on a bucket rather than being perturbed by interleaved prefill tokens every step. This is the policy half of S1 - S1 makes a bucketed step reusable; S3 makes the scheduler emit bucketed steps. Pair them.

Rejected/deferred (record so they are not re-tried):

More CUDA-graph coverage alone (the README lever-2 redo): still FLAT without S1. Forcing more ops graph-eligible (beyond 0025) does nothing while layer A rebuilds the graph every step - the recapture dominates. Only valuable after S1 makes reuse hold.
GGML_CUDA_DISABLE_GRAPHS / disabling graphs in serving: REJECTED a priori as a fix (it is an A/B probe for Phase 0, not a lever) - it removes capture cost but also removes replay benefit; expected net-negative.
Precision levers (W4A16, bf16-SSM): out of scope - this gap is host-bound, not GEMM/BW-bound (see README s.5 rejections; do not reopen).

5. Phase 0 - confirm it is host-bound BEFORE building (run when the GPU frees)

Do NOT build any lever until this confirms host-bound. The dev tree already has all the instrumentation; this is a measurement, not a code change. One GPU bencher at a time (GPU-contention rule).

Workload. Real continuous serving, not batched-bench: run llama-server (paged build) with the paged config and drive it with a steady concurrent streaming load (e.g. a K-client async generator hitting /completion with staggered arrivals so requests start/finish asynchronously - the regime batched-bench cannot produce). Use the same models/flags as README s.4: -fa on -ngl 99, LLAMA_KV_PAGED=1 (+ LLAMA_MOE_FORCE_GRAPHS=1 for MoE), dense Qwen3.6-27B-NVFP4 and MoE Qwen3.6-35B-A3B-NVFP4. Pick K so the effective decode width matches a static npl you have a kernel-regime number for (e.g. ~128) - that gives the apples comparison: static 6.1 vs serving 3.7 tok/s/seq.

Signals to capture (all already exist):

Graph reuse rate. The graphs reused = N perf line (llama-context.cpp ~L4146, from data.n_reused) over total decode steps. Hypothesis: ~100% in batched-bench, near 0% in serving. This is the single most decisive number. A/B with LLAMA_GRAPH_REUSE_DISABLE=1 (forces the rebuild path) - if serving is already near that floor, layer-A reuse is the gap.
[L5INSTR] host buckets (printed at exit): hostproc, set_inputs, get_block_table mean ms/step. Compare serving vs batched-bench. A/B the block-table cache with LLAMA_PAGED_NO_BT_CACHE.
GPU-busy % in a steady-state serving window via nsys (sum of kernel durations / wall) and the inter-launch host gap (time between consecutive cudaGraphLaunch/kernel launches). Hypothesis: batched-bench ~96-99% busy (README/methodology note the early "low util" was a window artifact); serving materially lower, with the gap ~= hostproc/step. Watch the same window artifact the methodology warns about - measure a clean steady-state span.
CUDA-graph re-instantiation count - confirm layer B is also re-capturing (nsys shows cudaGraphInstantiate/cudaGraphExecUpdate per step, or add a host-side counter print - host-side only, no kernel code).

Decision rule. Host-bound (proceed with S1/S2/S3) if: serving graphs reused is low AND hostproc/step is a large fraction of serving per-step wall AND GPU-busy% drops vs batched-bench by ~the observed throughput ratio (~3.7/6.1). If instead GPU-busy% stays high and per-kernel time grows, the cause is elsewhere (e.g. serving runs a worse effective batch shape into the kernels) - re-scope before building.

Ground-truth vLLM (both-engine rule). Capture vLLM at the same concurrency: GPU-busy% / step cadence (nsys) and its scheduler step time. Confirm vLLM stays GPU-bound (persistent graphs) where paged goes host-bound - that is the direct evidence the gap is the host loop, and it sizes the achievable win.

6. Summary

The serving gap (paged 3.7 vs vLLM 5.9 tok/s/seq, -39%) is a host/scheduler problem, distinct from the decode kernel (at parity in batched-bench). The README's BW-floor/host-loop-residual findings are kernel-regime and do not bound the serving regime.
Leading mechanism: continuous batching's batch-shape + seq-set churn breaks both graph-reuse layers (llama-context can_reuse, CUDA update_required) every step, so the GPU idles while the host rebuilds + re-captures + runs un-graphed set_inputs. vLLM avoids this with padded/bucketed decode shapes + piecewise CUDA graphs.
The shipped scheduler patches (0008/0013/0016/0024/0025/0029) target prefill freezing + burst collapse, not decode-step graph reuse - which is why the serving gap survives them.
Top levers (all host-side, bit-exact-safe): S1 bucketed/padded decode-step shape for graph reuse, S2 double-buffer/overlap per-step host work, S3 graph-shape-stable scheduling (extend 0016). Gate everything on Phase 0: the graphs reused rate + [L5INSTR] host buckets + nsys GPU-busy% in real llama-server serving vs batched-bench, with vLLM ground-truthed at the same concurrency.

24 KiB Raw Blame History