mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-29 19:06:43 -04:00
feat(paged): close the continuous-serving decode gap (S1+S3, patches 0040/0041)
Add the two decode-serving graph-reuse levers (validated on GB10) that close the host-bound serving gap (paged dropped to ~3.7 vs vLLM ~5.9 tok/s/seq in real continuous serving while tying it in static batched-bench). - 0040 S1 paged decode-graph reuse: the paged decode inputs never overrode llm_graph_input_i::can_reuse (defaults false), so the host rebuilt the ggml graph on EVERY decode step (layer-A reuse 0%). Add a 256-bucketed-shape can_reuse + a live-mctx refresh from the owning attn input. Bit-exact (md5 byte-identical reuse on/off). Static batched-bench: paged reuse 0% -> 95.5%. - 0041 S3 decode-shape-stable scheduling: keep co-batched prefill out of decode steps so the scheduler emits the reuse-stable pure-decode shape S1 can reuse. Default-off policy on top of 0016; bit-exact (per-stream independent). S1+S3 together (128-client staggered serving, MoE Qwen3.6-35B-A3B-NVFP4): graph reuse 0% -> 72.2%, hostproc 15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9). S1 alone is insufficient (13.8%); S3 is the multiplier. S2 (double-buffer set_inputs) dropped: Phase-0 put set_inputs at ~0.05 ms/step, so it has nothing to recover. README patch table + DECODE_SERVING_SCOPE.md updated with results and the padded/fixed-slot follow-up. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -86,9 +86,10 @@ orthogonal to the paged allocator.
|
||||
|
||||
---
|
||||
|
||||
## 3. Patch series (0001-0031)
|
||||
## 3. Patch series (0001-0041)
|
||||
|
||||
29 patches (0005 and 0027 are intentionally unused). "Bit-exact" = greedy md5 /
|
||||
Source-only patches, with intentional numbering gaps (e.g. 0005, 0027). The
|
||||
decode-serving graph-reuse levers are 0040-0041. "Bit-exact" = greedy md5 /
|
||||
`test-backend-ops` byte-identical to the relevant baseline; the gate methodology
|
||||
is in section 5.
|
||||
|
||||
@@ -122,6 +123,33 @@ bit-exact. 0017 is the dense FP4-GEMM occupancy-tune track: bit-exact gate green
|
||||
but every cheap occupancy lever regressed on GB10, so nothing is enabled - it
|
||||
ships as the parity gate + default-off instrumentation only.)
|
||||
|
||||
### Decode-serving graph reuse (0040, 0041)
|
||||
|
||||
These two close the **continuous-serving** decode gap (distinct from the static
|
||||
batched-bench decode kernel, which is already at vLLM parity - see
|
||||
[`docs/DECODE_SERVING_SCOPE.md`](docs/DECODE_SERVING_SCOPE.md)). In serving the
|
||||
host rebuilt the ggml graph on **every** decode step (layer-A graph reuse was 0%),
|
||||
so the GPU idled while the host rebuilt - the host-bound -39% the static bench
|
||||
hides.
|
||||
|
||||
| # | What it does | Bit-exact |
|
||||
|---|---|---|
|
||||
| 0040 | **S1 paged decode-graph reuse** - the paged decode inputs (`input_block_table` / `input_gather_idxs`) never overrode `can_reuse` (defaults to false), so any graph carrying a paged input could never be reused. Add a correct `can_reuse` keyed on the (256-bucketed) block-table dims + a live-mctx refresh from the owning attn input. `LLAMA_PAGED_NO_GRAPH_REUSE=1` forces the pre-S1 path. | yes (md5 byte-identical reuse on/off; dense `5951a5b4`, paged-MoE `8cb0ce23`) |
|
||||
| 0041 | **S3 decode-shape-stable scheduling** - keep co-batched prefill OUT of decode steps so the pure-decode batch shape stays reuse-stable (S1 makes a pure-decode step reusable; S3 makes the scheduler emit them). Pure `update_slots()` policy on top of 0016; prefill admitted on a bounded cadence (`LLAMA_PAGED_PREFILL_PERIOD`, default 8). `LLAMA_PAGED_DECODE_STABLE=1` to enable. | yes (default-off byte-identical; per-stream independent in serving) |
|
||||
|
||||
Measured (GB10, MoE Qwen3.6-35B-A3B-NVFP4, 128-client staggered streaming load):
|
||||
graph reuse **0% -> 72.2%**, host window `hostproc` **15.98 -> 6.31 ms/step**,
|
||||
decode **4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9
|
||||
sustained)**. S1 is necessary but **not** sufficient alone (13.8% reuse - prefill
|
||||
co-batching churns the shape nearly every step); S3 is the multiplier, so they
|
||||
ship and are measured together. The static batched-bench A/B isolates the S1
|
||||
mechanism: paged decode reuse 0% -> 95.5% (throughput flat there, since the static
|
||||
regime is GPU-bound). **S2 (double-buffer `set_inputs`) was dropped**: the Phase-0
|
||||
profile put `set_inputs` at ~0.05 ms/step (the cost is the rebuild, not the input
|
||||
copy), so it has nothing to recover. The remaining ~28% serving rebuilds are
|
||||
request-boundary D/seq-set churn + the prefill-cadence steps; a padded/fixed-slot
|
||||
decode shape to capture them is scoped in `docs/DECODE_SERVING_SCOPE.md`.
|
||||
|
||||
### SSM (gated-DeltaNet) decode levers (0018-0022, 0028)
|
||||
|
||||
These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact.
|
||||
|
||||
@@ -1,7 +1,39 @@
|
||||
# DECODE_SERVING_SCOPE - the continuous-serving decode gap (design only)
|
||||
# DECODE_SERVING_SCOPE - the continuous-serving decode gap
|
||||
|
||||
**Status: DESIGN + SCOPE + RANKED LEVER PLAN ONLY. No kernel written, no GPU
|
||||
run in this pass (the GPU was busy with prefill agents).** Per the
|
||||
**Status: S1 + S3 IMPLEMENTED, GPU-validated, bit-exact, shipped as patches
|
||||
0040 (S1) + 0041 (S3). S2 DROPPED (measured non-target). See the results block
|
||||
below; the rest of this doc is the design/rationale those patches implement.**
|
||||
|
||||
## Results (GB10, measured)
|
||||
|
||||
Phase 0 confirmed host-bound: serving graph reuse **0% over ~5k steps** (layer-A
|
||||
rebuilds every step), `hostproc` 3.44 ms/step vs 1.59 static - the +1.85 ms IS the
|
||||
graph rebuild; `set_inputs` 0.047 ms and block-table 0.002 ms are negligible.
|
||||
|
||||
- **S1 (patch 0040)** - root cause: the paged decode inputs never overrode
|
||||
`can_reuse` (defaults false), so the graph could never be reused. Fixed with a
|
||||
256-bucketed-shape `can_reuse` + live-mctx refresh. Static batched-bench A/B:
|
||||
paged decode reuse **0% -> 95.5%**, bit-exact (md5 byte-identical reuse on/off).
|
||||
Necessary but **not** sufficient in serving (13.8% reuse alone - prefill
|
||||
co-batching churns the shape).
|
||||
- **S3 (patch 0041)** - keeps prefill out of decode steps so the scheduler emits
|
||||
reuse-stable pure-decode steps. **S1+S3 together (128-client staggered serving,
|
||||
MoE Qwen3.6-35B-A3B-NVFP4): reuse 0% -> 72.2%, `hostproc` 15.98 -> 6.31 ms/step,
|
||||
decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9).**
|
||||
- **S2 (double-buffer set_inputs) - DROPPED.** Phase 0 put `set_inputs` at
|
||||
~0.05 ms/step: it is not the cost (the rebuild is), so S2 has nothing to recover.
|
||||
- **Follow-up to ~100% reuse:** the remaining ~28% serving rebuilds are
|
||||
request-boundary D/seq-set churn + the S3 prefill-cadence steps. Capturing them
|
||||
needs a **padded/fixed-slot decode shape** (pad the decode width to a fixed
|
||||
bucket with masked-inert dummy slots so `n_tokens` and the seq-id set stay
|
||||
constant across arrivals/completions - the lever S1 section (a) describes).
|
||||
Deferred: S1+S3 already reach vLLM-parity on the mean; padding is server-side,
|
||||
invasive, and not exercised by the single-sequence md5 gate (needs a per-stream
|
||||
serving-determinism gate). It is the next lever, not a shipped one.
|
||||
|
||||
---
|
||||
|
||||
Per the
|
||||
"profile-don't-assume" rule in
|
||||
[`.agents/vllm-parity-methodology.md`](../../../../.agents/vllm-parity-methodology.md),
|
||||
**Phase 0 (section 5) is to confirm the bottleneck on GPU before touching any
|
||||
|
||||
@@ -0,0 +1,322 @@
|
||||
From b81fa71360c3f6b46e97c6ad504efc10bdaea484 Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Sun, 28 Jun 2026 20:00:04 +0200
|
||||
Subject: [PATCH 40/41] feat(paged): S1 paged decode-graph reuse across serving
|
||||
steps (patch 0040)
|
||||
|
||||
The continuous-serving decode gap (paged ~3.7 vs vLLM ~5.9 tok/s/seq) is
|
||||
host-bound: llama-context layer-A graph reuse was 0% in serving, so the host
|
||||
rebuilt the ggml graph EVERY decode step (the +1.85 ms/step the Phase-0 profile
|
||||
attributes to the rebuild; set_inputs/block-table are negligible). Root cause:
|
||||
the paged decode inputs (input_block_table / input_gather_idxs in paged-attn.cpp)
|
||||
never overrode llm_graph_input_i::can_reuse, which defaults to false - so any
|
||||
graph carrying a paged input could never be reused, even with a constant batch
|
||||
shape. (This is also why the paged decode graph rebuilt in static batched-bench.)
|
||||
|
||||
S1 gives the paged inputs a correct can_reuse:
|
||||
- reuse iff the input tensor dims are unchanged. The block table is
|
||||
[n_view, n_stream] with n_view = PAD(n_gather, 256) clamped to n_kv, so it is
|
||||
bucketed to 256 and stays constant across a 256-token decode window; n_stream
|
||||
follows n_seqs_unq. The index CONTENTS are refilled at set_input on every step
|
||||
(incl. reused steps), so a reused graph reads the current step's cells.
|
||||
- the stored kv-cache context is refreshed from the owning attn input
|
||||
(llm_graph_input_attn_kv, whose mctx is updated per-decode by attn_kv /
|
||||
mem_hybrid can_reuse earlier in the input list), so a reused graph picks up the
|
||||
live memory context. mem_hybrid::can_reuse now also refreshes inp_attn->mctx.
|
||||
|
||||
Master switch paged_attn::decode_graph_reuse() (ON by default when paged;
|
||||
LLAMA_PAGED_NO_GRAPH_REUSE=1 forces the pre-S1 rebuild-every-step path for A/B).
|
||||
Also surfaces the run-wide graph-reuse rate in the [L5INSTR] exit line
|
||||
(l5_add_proc) since llama-server does not print llama_perf.
|
||||
|
||||
BIT-EXACT: greedy md5 byte-identical with reuse ON vs OFF on every path -
|
||||
dense 5951a5b4d624ce891e22ab5fca9bc439, paged-MoE 8cb0ce23777bf55f92f63d0292c756b0.
|
||||
Reuse only skips the host-side rebuild; set_inputs still re-runs every step.
|
||||
|
||||
Measured (GB10): batched-bench paged decode graph reuse 0% -> 95.5% (hostproc
|
||||
dense 3.31->2.66, MoE 2.44->1.82 ms/step); static throughput flat as expected
|
||||
(static regime is GPU-bound). The serving payoff needs S3 (patch 0041): S1 alone
|
||||
holds only 13.8% reuse in serving because co-batched prefill churns the shape
|
||||
every step.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||||
---
|
||||
src/llama-context.cpp | 3 ++
|
||||
src/llama-graph.cpp | 13 +++++-
|
||||
src/paged-attn.cpp | 94 ++++++++++++++++++++++++++++++++++++++-----
|
||||
src/paged-attn.h | 14 +++++++
|
||||
4 files changed, 112 insertions(+), 12 deletions(-)
|
||||
|
||||
diff --git a/src/llama-context.cpp b/src/llama-context.cpp
|
||||
index c408eef..306a506 100644
|
||||
--- a/src/llama-context.cpp
|
||||
+++ b/src/llama-context.cpp
|
||||
@@ -1347,6 +1347,7 @@ bool llama_context::set_adapter_cvec(
|
||||
|
||||
extern "C" void l5_add_setinp(double ns);
|
||||
extern "C" void l5_add_hostproc(double ns);
|
||||
+extern "C" void l5_add_proc(int reused); // [S1] per-step graph-reuse counter
|
||||
static inline double l5c_now_ns(){ struct timespec ts; clock_gettime(CLOCK_MONOTONIC,&ts); return (double)ts.tv_sec*1e9+(double)ts.tv_nsec; }
|
||||
llm_graph_result * llama_context::process_ubatch(const llama_ubatch & ubatch, llm_graph_type gtype, llama_memory_context_i * mctx, ggml_status & ret) {
|
||||
double _l5_t0=l5c_now_ns();
|
||||
@@ -1374,7 +1375,9 @@ llm_graph_result * llama_context::process_ubatch(const llama_ubatch & ubatch, ll
|
||||
}
|
||||
|
||||
n_reused++;
|
||||
+ l5_add_proc(1);
|
||||
} else {
|
||||
+ l5_add_proc(0);
|
||||
res->reset();
|
||||
|
||||
ggml_backend_sched_reset(sched.get());
|
||||
diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
|
||||
index 931258d..0337742 100644
|
||||
--- a/src/llama-graph.cpp
|
||||
+++ b/src/llama-graph.cpp
|
||||
@@ -699,6 +699,12 @@ bool llm_graph_input_mem_hybrid::can_reuse(const llm_graph_params & params) {
|
||||
|
||||
this->mctx = mctx;
|
||||
|
||||
+ // [S1] refresh the attn sub-input's memory context so paged decode inputs
|
||||
+ // (which read owner->mctx in their can_reuse, run later in the input list)
|
||||
+ // pick up the live per-decode context on a reused graph. Harmless for the
|
||||
+ // non-paged path: inp_attn->mctx is only consumed at graph-build time there.
|
||||
+ inp_attn->mctx = mctx->get_attn();
|
||||
+
|
||||
bool res = true;
|
||||
|
||||
res &= inp_attn->self_k_idxs->ne[0] == params.ubatch.n_tokens;
|
||||
@@ -2370,8 +2376,11 @@ ggml_tensor * llm_graph_context::build_attn(
|
||||
ggml_tensor * kq_mask_g = kq_mask;
|
||||
ggml_tensor * block_table = nullptr;
|
||||
const bool is_decode = (q_cur->ne[2] == k->ne[3]); // 1 query token per stream
|
||||
- if (!(is_decode && paged_attn::in_kernel_decode(ctx0, res, mctx_cur, &k, &v, &kq_mask_g, &block_table))) {
|
||||
- paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
|
||||
+ // [S1] pass `inp` (the attn input) as the reuse owner: its mctx is refreshed
|
||||
+ // per-decode by attn_kv/mem_hybrid can_reuse, and the paged inputs read it so
|
||||
+ // a reused graph picks up the live memory context.
|
||||
+ if (!(is_decode && paged_attn::in_kernel_decode(ctx0, res, mctx_cur, inp, &k, &v, &kq_mask_g, &block_table))) {
|
||||
+ paged_attn::gather(ctx0, res, mctx_cur, inp, &k, &v, &kq_mask_g);
|
||||
}
|
||||
|
||||
ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il, block_table);
|
||||
diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
|
||||
index ebd92be..d543c7f 100644
|
||||
--- a/src/paged-attn.cpp
|
||||
+++ b/src/paged-attn.cpp
|
||||
@@ -11,9 +11,13 @@
|
||||
#include <ctime>
|
||||
namespace { static inline double l5_now_ns(){ struct timespec ts; clock_gettime(CLOCK_MONOTONIC,&ts); return (double)ts.tv_sec*1e9+(double)ts.tv_nsec; } }
|
||||
double g_l5_t_gbt=0, g_l5_t_setinp=0, g_l5_t_hostproc=0; long g_l5_n_gbt=0, g_l5_n_setinp=0, g_l5_n_hostproc=0;
|
||||
+// [S1] graph-reuse counters across the whole run (the serving reuse-rate signal -
|
||||
+// llama-server does not print llama_perf, so surface it here at process exit).
|
||||
+long g_l5_n_proc=0, g_l5_n_reused=0;
|
||||
extern "C" void l5_add_setinp(double ns){ g_l5_t_setinp+=ns; g_l5_n_setinp++; }
|
||||
extern "C" void l5_add_hostproc(double ns){ g_l5_t_hostproc+=ns; g_l5_n_hostproc++; }
|
||||
-namespace { struct L5Printer { ~L5Printer(){ fprintf(stderr,"[L5INSTR] get_block_table n=%ld sum=%.2fms mean=%.4fms | set_inputs n=%ld sum=%.2fms mean=%.4fms | hostproc n=%ld sum=%.2fms mean=%.4fms\n", g_l5_n_gbt, g_l5_t_gbt/1e6, g_l5_n_gbt? g_l5_t_gbt/1e6/g_l5_n_gbt:0.0, g_l5_n_setinp, g_l5_t_setinp/1e6, g_l5_n_setinp? g_l5_t_setinp/1e6/g_l5_n_setinp:0.0, g_l5_n_hostproc, g_l5_t_hostproc/1e6, g_l5_n_hostproc? g_l5_t_hostproc/1e6/g_l5_n_hostproc:0.0 ); } } g_l5_printer; }
|
||||
+extern "C" void l5_add_proc(int reused){ g_l5_n_proc++; if (reused) g_l5_n_reused++; }
|
||||
+namespace { struct L5Printer { ~L5Printer(){ fprintf(stderr,"[L5INSTR] get_block_table n=%ld sum=%.2fms mean=%.4fms | set_inputs n=%ld sum=%.2fms mean=%.4fms | hostproc n=%ld sum=%.2fms mean=%.4fms | graph_reuse %ld/%ld = %.1f%%\n", g_l5_n_gbt, g_l5_t_gbt/1e6, g_l5_n_gbt? g_l5_t_gbt/1e6/g_l5_n_gbt:0.0, g_l5_n_setinp, g_l5_t_setinp/1e6, g_l5_n_setinp? g_l5_t_setinp/1e6/g_l5_n_setinp:0.0, g_l5_n_hostproc, g_l5_t_hostproc/1e6, g_l5_n_hostproc? g_l5_t_hostproc/1e6/g_l5_n_hostproc:0.0, g_l5_n_reused, g_l5_n_proc, g_l5_n_proc? 100.0*g_l5_n_reused/g_l5_n_proc:0.0 ); } } g_l5_printer; }
|
||||
|
||||
|
||||
namespace paged_attn {
|
||||
@@ -28,17 +32,52 @@ static bool debug() {
|
||||
return d;
|
||||
}
|
||||
|
||||
+// [S1] paged decode-graph reuse master switch. ON by default whenever paging is
|
||||
+// active; LLAMA_PAGED_NO_GRAPH_REUSE=1 forces it off (A/B probe / safety hatch).
|
||||
+bool decode_graph_reuse() {
|
||||
+ static const bool on = active() && (std::getenv("LLAMA_PAGED_NO_GRAPH_REUSE") == nullptr);
|
||||
+ return on;
|
||||
+}
|
||||
+
|
||||
namespace {
|
||||
|
||||
+// [S1] Recompute the block-table view length the SAME way in_kernel_decode()
|
||||
+// builds it, so can_reuse() can compare against the stored tensor dim. n_view is
|
||||
+// PAD(n_gather,256) clamped to the physical window n_kv: it only changes when
|
||||
+// n_gather crosses a 256 boundary, so a steady decode reuses across many steps.
|
||||
+static inline int64_t paged_block_table_n_view(const llama_kv_cache_context * mctx) {
|
||||
+ const int64_t n_gather = (int64_t) mctx->get_n_gather();
|
||||
+ if (n_gather <= 0) {
|
||||
+ return 0;
|
||||
+ }
|
||||
+ int64_t n_view = GGML_PAD(n_gather, 256);
|
||||
+ const int64_t n_kv = (int64_t) mctx->get_n_kv();
|
||||
+ if (n_view > n_kv) {
|
||||
+ n_view = n_kv;
|
||||
+ }
|
||||
+ return n_view;
|
||||
+}
|
||||
+
|
||||
+// [S1] Number of attention streams the paged inputs build over - matches K->ne[3]
|
||||
+// at build time and the n_stream used by can_reuse_kq_mask in llama-graph.cpp.
|
||||
+static inline int64_t paged_n_stream(const llm_graph_params & params) {
|
||||
+ return params.cparams.kv_unified ? 1 : (int64_t) params.ubatch.n_seqs_unq;
|
||||
+}
|
||||
+
|
||||
// Graph input that, at set_input time, fills an I32 [n_gather, n_stream] tensor
|
||||
// with each stream's non-empty cell indices (position-sorted, padded with a
|
||||
-// masked/empty cell) by delegating to the kv-cache context. Private to this
|
||||
-// unit; default can_reuse()==false keeps the graph from being reused across
|
||||
-// decodes (n_gather grows every step).
|
||||
+// masked/empty cell) by delegating to the kv-cache context. Private to this unit.
|
||||
+//
|
||||
+// [S1] can_reuse: the graph topology depends only on the tensor SHAPE
|
||||
+// [n_gather, n_stream] - the index CONTENTS are refilled at set_input every step,
|
||||
+// so they need not match. n_gather is UNPADDED here (the gather path is used for
|
||||
+// prefill / transposed-V fallback), so it grows every decode and reuse rarely
|
||||
+// holds - correct and harmless. mctx is refreshed from the owning attn input
|
||||
+// (whose mctx is updated by attn_kv/mem_hybrid can_reuse earlier in the input list).
|
||||
class input_gather_idxs : public llm_graph_input_i {
|
||||
public:
|
||||
- input_gather_idxs(const llama_kv_cache_context * mctx, ggml_tensor * idxs)
|
||||
- : mctx(mctx), idxs(idxs) {}
|
||||
+ input_gather_idxs(const llama_kv_cache_context * mctx, const llm_graph_input_attn_kv * owner, ggml_tensor * idxs)
|
||||
+ : mctx(mctx), owner(owner), idxs(idxs) {}
|
||||
|
||||
void set_input(const llama_ubatch * ubatch) override {
|
||||
GGML_UNUSED(ubatch);
|
||||
@@ -46,17 +85,37 @@ public:
|
||||
mctx->get_gather_idxs((int32_t *) idxs->data);
|
||||
}
|
||||
|
||||
+ bool can_reuse(const llm_graph_params & params) override {
|
||||
+ if (!owner || !paged_attn::decode_graph_reuse()) {
|
||||
+ return false;
|
||||
+ }
|
||||
+ mctx = owner->mctx; // refresh to the live per-decode context
|
||||
+ const int64_t n_gather = (int64_t) mctx->get_n_gather();
|
||||
+ if (n_gather <= 0) {
|
||||
+ return false;
|
||||
+ }
|
||||
+ return idxs->ne[0] == n_gather && idxs->ne[1] == paged_n_stream(params);
|
||||
+ }
|
||||
+
|
||||
const llama_kv_cache_context * mctx;
|
||||
+ const llm_graph_input_attn_kv * owner;
|
||||
ggml_tensor * idxs;
|
||||
};
|
||||
|
||||
// Block table filler for the in-kernel paged read: fills an I32 [n_blk, n_stream]
|
||||
// tensor with each stream's position-ordered cells, padded to n_blk (per column)
|
||||
// with a masked empty cell, by delegating to the kv-cache context.
|
||||
+//
|
||||
+// [S1] can_reuse: reuse iff the block-table tensor dims [n_view, n_stream] are
|
||||
+// unchanged - n_view is bucketed to 256 (paged_block_table_n_view), so the decode
|
||||
+// graph reuses across every step within a 256-token window. The table CONTENTS
|
||||
+// are refilled at set_input on every step (incl. reused steps), so the reused
|
||||
+// graph reads the current step's cells. mctx is refreshed from the owning attn
|
||||
+// input so the reused graph's set_input/get_block_table uses the live context.
|
||||
class input_block_table : public llm_graph_input_i {
|
||||
public:
|
||||
- input_block_table(const llama_kv_cache_context * mctx, ggml_tensor * idxs, uint32_t n_blk)
|
||||
- : mctx(mctx), idxs(idxs), n_blk(n_blk) {}
|
||||
+ input_block_table(const llama_kv_cache_context * mctx, const llm_graph_input_attn_kv * owner, ggml_tensor * idxs, uint32_t n_blk)
|
||||
+ : mctx(mctx), owner(owner), idxs(idxs), n_blk(n_blk) {}
|
||||
|
||||
void set_input(const llama_ubatch * ubatch) override {
|
||||
GGML_UNUSED(ubatch);
|
||||
@@ -66,7 +125,20 @@ public:
|
||||
g_l5_t_gbt += l5_now_ns()-_t; g_l5_n_gbt++;
|
||||
}
|
||||
|
||||
+ bool can_reuse(const llm_graph_params & params) override {
|
||||
+ if (!owner || !paged_attn::decode_graph_reuse()) {
|
||||
+ return false;
|
||||
+ }
|
||||
+ mctx = owner->mctx; // refresh to the live per-decode context
|
||||
+ const int64_t n_view = paged_block_table_n_view(mctx);
|
||||
+ if (n_view <= 0 || n_view != (int64_t) n_blk) {
|
||||
+ return false;
|
||||
+ }
|
||||
+ return idxs->ne[0] == n_view && idxs->ne[1] == paged_n_stream(params);
|
||||
+ }
|
||||
+
|
||||
const llama_kv_cache_context * mctx;
|
||||
+ const llm_graph_input_attn_kv * owner;
|
||||
ggml_tensor * idxs;
|
||||
uint32_t n_blk;
|
||||
};
|
||||
@@ -76,6 +148,7 @@ public:
|
||||
void gather(ggml_context * ctx0,
|
||||
llm_graph_result * res,
|
||||
const llama_kv_cache_context * mctx,
|
||||
+ const llm_graph_input_attn_kv * owner,
|
||||
ggml_tensor ** k,
|
||||
ggml_tensor ** v,
|
||||
ggml_tensor ** kq_mask) {
|
||||
@@ -114,7 +187,7 @@ void gather(ggml_context * ctx0,
|
||||
// n_stream, so column s gathers from stream s of the source.
|
||||
ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_gather, n_stream);
|
||||
ggml_set_input(idx);
|
||||
- res->add_input(llm_graph_input_ptr(new input_gather_idxs(mctx, idx)));
|
||||
+ res->add_input(llm_graph_input_ptr(new input_gather_idxs(mctx, owner, idx)));
|
||||
|
||||
// --- gather K: collapse (head_dim, n_head) so cells become the row axis ---
|
||||
{
|
||||
@@ -156,6 +229,7 @@ void gather(ggml_context * ctx0,
|
||||
bool in_kernel_decode(ggml_context * ctx0,
|
||||
llm_graph_result * res,
|
||||
const llama_kv_cache_context * mctx,
|
||||
+ const llm_graph_input_attn_kv * owner,
|
||||
ggml_tensor ** k,
|
||||
ggml_tensor ** v,
|
||||
ggml_tensor ** kq_mask,
|
||||
@@ -221,7 +295,7 @@ bool in_kernel_decode(ggml_context * ctx0,
|
||||
|
||||
ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_view, n_stream);
|
||||
ggml_set_input(idx);
|
||||
- res->add_input(llm_graph_input_ptr(new input_block_table(mctx, idx, (uint32_t) n_view)));
|
||||
+ res->add_input(llm_graph_input_ptr(new input_block_table(mctx, owner, idx, (uint32_t) n_view)));
|
||||
|
||||
// Present K and V as [d, h, n_view, ns] VIEWS of the full physical window:
|
||||
// identical per-cell (nb1,nb2) and per-stream (nb3) strides, only the cell
|
||||
diff --git a/src/paged-attn.h b/src/paged-attn.h
|
||||
index 23e2184..fafe821 100644
|
||||
--- a/src/paged-attn.h
|
||||
+++ b/src/paged-attn.h
|
||||
@@ -21,18 +21,31 @@ struct ggml_context;
|
||||
struct ggml_tensor;
|
||||
class llm_graph_result;
|
||||
class llama_kv_cache_context;
|
||||
+class llm_graph_input_attn_kv;
|
||||
|
||||
namespace paged_attn {
|
||||
|
||||
// true iff env LLAMA_KV_PAGED is set (evaluated once).
|
||||
bool active();
|
||||
|
||||
+// [S1] true iff the paged decode-graph reuse (layer-A can_reuse on the paged
|
||||
+// inputs) is ENABLED. Default ON when active(); LLAMA_PAGED_NO_GRAPH_REUSE=1
|
||||
+// forces it off (A/B probe / safety hatch). When off the paged inputs keep the
|
||||
+// stock default can_reuse()==false, i.e. the pre-S1 behaviour (rebuild every
|
||||
+// step). Bit-exact either way - reuse only skips the host-side graph rebuild,
|
||||
+// set_inputs still re-runs every step.
|
||||
+bool decode_graph_reuse();
|
||||
+
|
||||
// Gather K, V and the kq_mask down to the current sequence's non-empty cells.
|
||||
// No-op (returns immediately) unless active(). On return *k, *v and *kq_mask
|
||||
// point at the compacted tensors; pass them straight to build_attn_mha.
|
||||
+// `owner` is the attention input that owns the live (per-decode-refreshed) memory
|
||||
+// context; the paged input reads owner->mctx in can_reuse so a reused graph picks
|
||||
+// up the fresh context (see input_gather_idxs::can_reuse). May be null (no reuse).
|
||||
void gather(ggml_context * ctx0,
|
||||
llm_graph_result * res,
|
||||
const llama_kv_cache_context * mctx,
|
||||
+ const llm_graph_input_attn_kv * owner,
|
||||
ggml_tensor ** k,
|
||||
ggml_tensor ** v,
|
||||
ggml_tensor ** kq_mask);
|
||||
@@ -50,6 +63,7 @@ void gather(ggml_context * ctx0,
|
||||
bool in_kernel_decode(ggml_context * ctx0,
|
||||
llm_graph_result * res,
|
||||
const llama_kv_cache_context * mctx,
|
||||
+ const llm_graph_input_attn_kv * owner,
|
||||
ggml_tensor ** k,
|
||||
ggml_tensor ** v,
|
||||
ggml_tensor ** kq_mask,
|
||||
--
|
||||
2.43.0
|
||||
|
||||
@@ -0,0 +1,95 @@
|
||||
From ef2765d85829c9ede2fc9aa90523386d765c9040 Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Sun, 28 Jun 2026 20:00:24 +0200
|
||||
Subject: [PATCH 41/41] feat(paged): S3 decode-shape-stable scheduling (patch
|
||||
0041)
|
||||
|
||||
The S1 paged decode-graph reuse (patch 0040) is necessary but not sufficient in
|
||||
continuous serving: with cont_batching a co-batched prefill chunk inflates the
|
||||
step from n_tokens==D (pure decode) to D+P, which changes the ubatch shape and
|
||||
breaks llama-context layer-A reuse on (nearly) every step. Measured: S1 alone
|
||||
holds only 13.8% graph reuse in a 128-client serving load.
|
||||
|
||||
S3 makes the scheduler EMIT graph-reusable steps to match what S1 makes reusable.
|
||||
While there is live decode load it runs PURE-decode steps (skip Phase-2 prompt
|
||||
admission) so the decode batch shape stays constant, and admits a prefill chunk
|
||||
only on a bounded cadence (every LLAMA_PAGED_PREFILL_PERIOD steps, default 8) or
|
||||
when no decode is active. The deferred prefill chunk still runs within at most
|
||||
(period-1) decode steps, so prompt latency rises by a bounded amount.
|
||||
|
||||
Pure policy change inside update_slots(), built on the patch-0016 decode-first
|
||||
budget; no new slot states, no batch-formation rewrite, zero libllama changes.
|
||||
|
||||
BIT-EXACT: only changes WHICH step a prompt chunk is admitted in. Each sequence's
|
||||
decode logits depend on its own tokens + its own KV only (the paged decode read is
|
||||
per-stream, attention is permutation-invariant over the co-batched set), so
|
||||
deferring another slot's prefill never changes a generating slot's output.
|
||||
DEFAULT-OFF: LLAMA_PAGED_DECODE_STABLE unset => byte-identical to patch 0016. Does
|
||||
not run in the single-sequence greedy md5 gate (that path is llama-completion).
|
||||
|
||||
Measured (GB10, MoE Qwen3.6-35B-A3B-NVFP4, 128-client staggered streaming load):
|
||||
S1+S3 vs baseline (graphs rebuilt every step): graph reuse 0% -> 72.2%, hostproc
|
||||
15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean,
|
||||
at vLLM's ~5.9 sustained). Remaining 28% rebuilds are request-boundary D/seq-set
|
||||
churn + the prefill-cadence steps; closing them needs a padded/fixed-slot decode
|
||||
shape (scoped follow-up, see DECODE_SERVING_SCOPE.md).
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||||
---
|
||||
tools/server/server-context.cpp | 35 ++++++++++++++++++++++++++++++++-
|
||||
1 file changed, 34 insertions(+), 1 deletion(-)
|
||||
|
||||
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
|
||||
index 64775dc..9baca33 100644
|
||||
--- a/tools/server/server-context.cpp
|
||||
+++ b/tools/server/server-context.cpp
|
||||
@@ -3138,11 +3138,44 @@ private:
|
||||
}
|
||||
int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots)
|
||||
|
||||
+ // PAGED serving lever (patch 0041, S3): decode-shape-stable scheduling.
|
||||
+ // Pairs with the S1 paged decode-graph reuse (patch 0040): S1 makes a
|
||||
+ // pure-decode step graph-reusable, S3 makes the scheduler EMIT pure-decode
|
||||
+ // steps. With continuous batching a co-batched prefill chunk inflates the
|
||||
+ // step from n_tokens==D (pure decode) to D+P, which changes the ubatch
|
||||
+ // shape and breaks layer-A graph reuse on EVERY step. S3 keeps prefill out
|
||||
+ // of the decode step: while there is live decode load it runs pure-decode
|
||||
+ // steps (reuse holds) and admits a prefill chunk only on a bounded cadence
|
||||
+ // (every LLAMA_PAGED_PREFILL_PERIOD steps, default 8) or when no decode is
|
||||
+ // active. The deferred prefill chunk still runs within a few steps, so
|
||||
+ // prompt latency rises by at most (period-1) decode steps.
|
||||
+ //
|
||||
+ // BIT-EXACT: this only changes WHICH step a prompt chunk is admitted in.
|
||||
+ // Each sequence's decode logits depend on its own tokens + its own KV only
|
||||
+ // (the paged decode read is per-stream, attention is permutation-invariant
|
||||
+ // over the co-batched set), so deferring another slot's prefill never
|
||||
+ // changes a generating slot's output. DEFAULT-OFF: env unset => no change,
|
||||
+ // byte-identical to patch 0016. Does not run in the single-sequence greedy
|
||||
+ // md5 gate (that path is llama-completion, not update_slots).
|
||||
+ bool decode_only_step = false;
|
||||
+ {
|
||||
+ static const int s3_enabled = [](){ const char * e = getenv("LLAMA_PAGED_DECODE_STABLE"); return e ? atoi(e) : 0; }();
|
||||
+ if (s3_enabled && n_decode_in_batch > 0) {
|
||||
+ static const int s3_period = [](){ const char * e = getenv("LLAMA_PAGED_PREFILL_PERIOD"); int p = e ? atoi(e) : 8; return p > 0 ? p : 8; }();
|
||||
+ static long s3_step = 0;
|
||||
+ const bool prefill_due = (s3_step % s3_period) == 0;
|
||||
+ s3_step++;
|
||||
+ decode_only_step = !prefill_due;
|
||||
+ }
|
||||
+ }
|
||||
+
|
||||
auto & alora_scale = batch.alora_scale;
|
||||
auto & alora_disabled_id = batch.alora_disabled_id;
|
||||
|
||||
// next, batch any pending prompts without exceeding n_batch
|
||||
- if (params_base.cont_batching || batch.size() == 0) {
|
||||
+ // (patch 0041, S3) skip prompt admission on a pure-decode step to keep the
|
||||
+ // decode batch shape reuse-stable
|
||||
+ if ((params_base.cont_batching || batch.size() == 0) && !decode_only_step) {
|
||||
bool add_ok = true; // false means the batch is full, skip remaining slots
|
||||
|
||||
iterate(slots, [&](server_slot & slot) {
|
||||
--
|
||||
2.43.0
|
||||
|
||||
Reference in New Issue
Block a user