Commit Graph

3 Commits

Author SHA1 Message Date
Ettore Di Giacinto
b028c81eda docs(paged): record padded/fixed-slot decode shape as tested-and-rejected
The S1 section-(a) padded/fixed-slot decode shape (the scoped follow-up to push
serving graph reuse from ~72% toward ~100%) was implemented in an isolated
worktree off the committed S1/S3/tail base, built CUDA-only, and benched on GB10.
Verdict: REJECTED. It is bit-exact and provably inert, but it regresses serving
throughput at every concurrency and does not close the vLLM gap.

Implementation (default-off, LLAMA_PAGED_PAD_DECODE): on a pure-decode step
(n_prompt_budgeted == 0) emit a masked-inert dummy decode for every idle slot so
n_tokens / n_seqs / n_seqs_unq / n_outputs and the seq-id set stay constant; a
release()-side guard keeps a finished slot warm under padding. Each dummy is its
own sequence (private recurrent state, per-stream paged attention, logits
discarded), so it cannot perturb a real stream.

Gates: single-seq greedy md5 bit-exact (dense 5951a5b4, paged-MoE 8cb0ce23). The
literal per-stream ON-vs-OFF identity gate is unachievable - concurrent cuBLAS/FA
decode is not bit-reproducible run-to-run even with padding off (OFF-vs-OFF
diverging streams: dense 3/16, MoE 8/16). The achievable inertness gate passed:
ON-vs-OFF per-stream prefix-agreement equals the OFF-vs-OFF noise floor exactly
(MoE 0.940/0.940, dense 0.812/0.812), so the dummy slots leak nothing.

Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10), burst decode tok/s/seq: n=8 S1+S3 28.16
/ PAD 6.05 / vLLM 44.8; n=128 S1+S3 4.53 / PAD 4.32 / vLLM 6.87. Staggered
aggregate tok/s: baseline (reuse 0%) 757.6, S1+S3 (reuse 72%) 763.3, PAD
(reuse 38%) 558.0.

Why it fails: (1) serving decode here is GPU-compute-bound, not host-rebuild-bound
- baseline reuse 0% ~= S1+S3 reuse 72% on aggregate tok/s, so closing reuse buys
~nothing (the earlier 542->762 host-bound delta did not reproduce); (2) padding
adds dummy-row compute proportional to pad_width - real_load, catastrophic at low
load; (3) in continuous serving padding cannot hold a constant width (perpetual
prefill churn) so reuse drops 72% -> 38%; (4) the completion-driven batch shrink
padding prevents is itself a throughput win in a compute-bound regime. The
residual burst gap is GPU-compute, which a host-side reuse lever cannot close.

Patch series unchanged: this rejected lever is NOT added to patches/paged/.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 20:47:43 +00:00
Ettore Di Giacinto
d706980c2b feat(paged): close the continuous-serving decode gap (S1+S3, patches 0040/0041)
Add the two decode-serving graph-reuse levers (validated on GB10) that close the
host-bound serving gap (paged dropped to ~3.7 vs vLLM ~5.9 tok/s/seq in real
continuous serving while tying it in static batched-bench).

- 0040 S1 paged decode-graph reuse: the paged decode inputs never overrode
  llm_graph_input_i::can_reuse (defaults false), so the host rebuilt the ggml
  graph on EVERY decode step (layer-A reuse 0%). Add a 256-bucketed-shape
  can_reuse + a live-mctx refresh from the owning attn input. Bit-exact (md5
  byte-identical reuse on/off). Static batched-bench: paged reuse 0% -> 95.5%.
- 0041 S3 decode-shape-stable scheduling: keep co-batched prefill out of decode
  steps so the scheduler emits the reuse-stable pure-decode shape S1 can reuse.
  Default-off policy on top of 0016; bit-exact (per-stream independent).

S1+S3 together (128-client staggered serving, MoE Qwen3.6-35B-A3B-NVFP4): graph
reuse 0% -> 72.2%, hostproc 15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq
median (4.24 -> 5.96 mean, at vLLM's ~5.9). S1 alone is insufficient (13.8%);
S3 is the multiplier. S2 (double-buffer set_inputs) dropped: Phase-0 put
set_inputs at ~0.05 ms/step, so it has nothing to recover. README patch table +
DECODE_SERVING_SCOPE.md updated with results and the padded/fixed-slot follow-up.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 18:04:28 +00:00
Ettore Di Giacinto
9a28f23134 docs(paged): scope the continuous-serving decode gap (host-bound, design-only)
Add DECODE_SERVING_SCOPE.md: the decode KERNEL is at parity in static
batched-bench (~6.1 tok/s/seq ~ vLLM ~5.9 at npl128) but continuous serving
through llama-server update_slots() drops to ~3.7 (-39%) while vLLM sustains
~5.9. Scope shows the gap is the scheduler/host loop, not the kernel.

Root-cause hypothesis from source: continuous batching's batch-shape + seq-set
churn breaks BOTH graph-reuse layers every step - llama-context can_reuse/
allow_reuse (n_tokens + seq-set must match) and the CUDA ggml_cuda_graph
update_required memcmp (ne/nb/data ptrs) - so the GPU idles while the host
rebuilds + re-captures the graph and runs un-graphed set_inputs. vLLM avoids
this with padded/bucketed decode shapes + piecewise CUDA graphs. Documents that
the shipped scheduler patches (0008/0013/0016/0024/0025/0029) target prefill
freezing + burst collapse, NOT decode-step graph reuse, which is why the serving
gap survives them; notes the README s.5 'lever 2 graph coverage FLAT' verdict was
static-regime and is reopened here for serving only.

Ranks host-side, bit-exact-safe levers: S1 bucketed/padded decode-step shape for
graph reuse, S2 double-buffer/overlap per-step host work, S3 graph-shape-stable
scheduling (extend 0016). Specifies a Phase-0 profile to confirm host-bound
before any build, reusing the in-tree [L5INSTR] hostproc/set_inputs/
get_block_table timers, the 'graphs reused' perf counter, LLAMA_GRAPH_REUSE_DISABLE
and nsys GPU-busy%, with vLLM ground-truthed at the same concurrency. No kernel
code; no GPU run in this pass.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 17:14:51 +00:00