Files
LocalAI/backend/cpp/llama-cpp-localai-paged/docs
Ettore Di Giacinto 9a28f23134 docs(paged): scope the continuous-serving decode gap (host-bound, design-only)
Add DECODE_SERVING_SCOPE.md: the decode KERNEL is at parity in static
batched-bench (~6.1 tok/s/seq ~ vLLM ~5.9 at npl128) but continuous serving
through llama-server update_slots() drops to ~3.7 (-39%) while vLLM sustains
~5.9. Scope shows the gap is the scheduler/host loop, not the kernel.

Root-cause hypothesis from source: continuous batching's batch-shape + seq-set
churn breaks BOTH graph-reuse layers every step - llama-context can_reuse/
allow_reuse (n_tokens + seq-set must match) and the CUDA ggml_cuda_graph
update_required memcmp (ne/nb/data ptrs) - so the GPU idles while the host
rebuilds + re-captures the graph and runs un-graphed set_inputs. vLLM avoids
this with padded/bucketed decode shapes + piecewise CUDA graphs. Documents that
the shipped scheduler patches (0008/0013/0016/0024/0025/0029) target prefill
freezing + burst collapse, NOT decode-step graph reuse, which is why the serving
gap survives them; notes the README s.5 'lever 2 graph coverage FLAT' verdict was
static-regime and is reopened here for serving only.

Ranks host-side, bit-exact-safe levers: S1 bucketed/padded decode-step shape for
graph reuse, S2 double-buffer/overlap per-step host work, S3 graph-shape-stable
scheduling (extend 0016). Specifies a Phase-0 profile to confirm host-bound
before any build, reusing the in-tree [L5INSTR] hostproc/set_inputs/
get_block_table timers, the 'graphs reused' perf counter, LLAMA_GRAPH_REUSE_DISABLE
and nsys GPU-busy%, with vLLM ground-truthed at the same concurrency. No kernel
code; no GPU run in this pass.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 17:14:51 +00:00
..