diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md index 6804b6605..e14a0837c 100644 --- a/backend/cpp/llama-cpp-localai-paged/README.md +++ b/backend/cpp/llama-cpp-localai-paged/README.md @@ -86,7 +86,7 @@ orthogonal to the paged allocator. --- -## 3. Patch series (0001-0041) +## 3. Patch series (0001-0043) Source-only patches, with intentional numbering gaps (e.g. 0005, 0027). The decode-serving graph-reuse levers are 0040-0041. "Bit-exact" = greedy md5 / @@ -135,14 +135,19 @@ hides. | # | What it does | Bit-exact | |---|---|---| | 0040 | **S1 paged decode-graph reuse** - the paged decode inputs (`input_block_table` / `input_gather_idxs`) never overrode `can_reuse` (defaults to false), so any graph carrying a paged input could never be reused. Add a correct `can_reuse` keyed on the (256-bucketed) block-table dims + a live-mctx refresh from the owning attn input. `LLAMA_PAGED_NO_GRAPH_REUSE=1` forces the pre-S1 path. | yes (md5 byte-identical reuse on/off; dense `5951a5b4`, paged-MoE `8cb0ce23`) | -| 0041 | **S3 decode-shape-stable scheduling** - keep co-batched prefill OUT of decode steps so the pure-decode batch shape stays reuse-stable (S1 makes a pure-decode step reusable; S3 makes the scheduler emit them). Pure `update_slots()` policy on top of 0016; prefill admitted on a bounded cadence (`LLAMA_PAGED_PREFILL_PERIOD`, default 8). **Default ON under paged KV** (enabled when `LLAMA_KV_PAGED` is set; `LLAMA_PAGED_DECODE_STABLE=0` forces it off). | yes (byte-identical on/off; per-stream independent in serving) | +| 0041 | **S3 decode-shape-stable scheduling** - keep co-batched prefill OUT of decode steps so the pure-decode batch shape stays reuse-stable (S1 makes a pure-decode step reusable; S3 makes the scheduler emit them). Pure `update_slots()` policy on top of 0016; prefill admitted on a bounded cadence (`LLAMA_PAGED_PREFILL_PERIOD`, default 8). **Default OFF** (opt-in via `LLAMA_PAGED_DECODE_STABLE=1`): a measured end-to-end A/B proved default-on is a serving mistake - deferring prefill admission on the period-8 cadence gives **2.5x worse TTFT** (60s vs 24s at N=256) and **20-29% lower end-to-end throughput**, with no end-to-end win at any concurrency; its apparent `decode_agg` gain was a metric artifact (faster per-step decode bought by starving prefill). Default prefers prompt prefill admission for good TTFT; opt in only for decode-dominated, low-arrival traffic where TTFT is not a concern. | yes (byte-identical on/off; per-stream independent in serving) | Measured (GB10, MoE Qwen3.6-35B-A3B-NVFP4, 128-client staggered streaming load): graph reuse **0% -> 72.2%**, host window `hostproc` **15.98 -> 6.31 ms/step**, decode **4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9 sustained)**. S1 is necessary but **not** sufficient alone (13.8% reuse - prefill -co-batching churns the shape nearly every step); S3 is the multiplier, so they -ship and are measured together. The static batched-bench A/B isolates the S1 +co-batching churns the shape nearly every step); S3 is the multiplier of that +per-step decode metric. **But those are per-step decode numbers, not an end-to-end +serving win**: a later end-to-end A/B showed S3-default-on regresses real serving +(2.5x worse TTFT, 20-29% lower end-to-end throughput, no win at any concurrency), +because the period-8 cadence defers prefill admission. So **only S1 (0040) ships +default-on; S3 (0041) now defaults OFF and is opt-in** (`LLAMA_PAGED_DECODE_STABLE=1`, +for decode-dominated low-arrival traffic). The static batched-bench A/B isolates the S1 mechanism: paged decode reuse 0% -> 95.5% (throughput flat there, since the static regime is GPU-bound). **S2 (double-buffer `set_inputs`) was dropped**: the Phase-0 profile put `set_inputs` at ~0.05 ms/step (the cost is the rebuild, not the input @@ -168,12 +173,13 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact | 0022 | **GDN recurrence occupancy/coalescing retune** - column-folding (NUM_WARPS/COLS_PER_WARP) raises memory-level parallelism on the bandwidth-bound B=128 recurrence kernel; per-column f32 FMA order unchanged. 73.4%->84.6% of GB10 peak BW. | dense +11.1% / MoE +8.3% | | 0028 | **Recurrent conv-tap gather fusion** - the last `k_get_rows` in the GDN decode path (the conv-state tap gather) becomes an indexed in-kernel read. | dense ~377 t/s / MoE ~784 t/s | -### MoE NVFP4 quant (0023, 0025) +### MoE NVFP4 quant (0023, 0025, 0043) | # | What it does | Bit-exact | |---|---|---| | 0023 | **NVFP4 activation-quantize de-dup** - the broadcast up/gate projections re-quantize the same token activation once per expert; quantize the unique token activations once and byte-copy them into the expert-gathered layout. The only NVFP4-specific patch. | yes (byte-identical) | -| 0025 | **MoE decode re-graph** - keep CUDA graphs on for the grouped-MMQ MoE decode step (the upstream guard disables graphs conservatively; the grouped path has no host sync). Env-gated `LLAMA_MOE_FORCE_GRAPHS`. | yes (graph replay re-issues identical kernels) | +| 0025 | **MoE decode re-graph** - keep CUDA graphs on for the grouped-MMQ MoE decode step (the upstream guard disables graphs conservatively; the grouped path has no host sync). Was env-gated `LLAMA_MOE_FORCE_GRAPHS`; now ON by default via 0043. | yes (graph replay re-issues identical kernels) | +| 0043 | **MoE decode graph default-on (D1)** - flip 0025 to ON by default: capture/replay the full-step decode CUDA graph (incl. the grouped-MMQ MoE dispatch) instead of re-issuing every kernel each step. Guard is `should_use_mmq()` (FALSE for the large-M NVFP4 prefill of 0034, so prefill keeps graphs disabled - its per-expert host-loop genuinely syncs). `LLAMA_MOE_NO_FORCE_GRAPHS=1` forces the conservative pre-0025 disable for A/B. D1 profiling: the per-expert host-loop (the only device->host MoE-routing readback) is never hit on the NVFP4 grouped path (sync count identical graphs on/off); steady decode is ~99% GPU-busy, so the cost removed is per-step host kernel RE-ISSUE, not a sync. | yes (md5 byte-identical default/off/forced; paged-MoE `8cb0ce23`, dense `5951a5b4`) | ### Pool reclaim, block-table cache, backend gate @@ -338,6 +344,21 @@ llama is losing. The MoE GEMM kernel is *not* where the gap lives. - **Lever 2 - graph/stream coverage: FLAT.** Bit-exact graph coverage was exhausted by 0025; more graph/stream overlap is a no-op or small regression on this model. +- **D1 premise "static decode is host-sync-bound on the MoE-routing readback": + REFUTED.** The hypothesis was that the dominant decode cost is the device->host + readback of MoE routing before launching the per-expert GEMMs (mul_mat_id's + per-expert host-loop fallback). Profiling (GB10, q36-35b-a3b-nvfp4, batched-bench + npl128) shows the opposite: on NVFP4 the grouped stream-k MMQ id-path is what + runs (routing stays device-side), so the host-loop fallback is **never hit** - + `cudaStreamSynchronize` count is *identical* with CUDA graphs on vs off (1457 + either way; only the kernel-launch count changes, ~100k vs ~229k). Steady-decode + GPU-busy is **~99%** (1% idle), i.e. static decode is GPU-bound, not idle waiting + on a sync. The one actionable residual the profile surfaced - per-step host + kernel **re-issue** when the step is not graph-captured - shipped as 0043 + (default-on full-step decode graph), worth +2.6% (npl128) to +5-13% (npl32). The + larger continuous-serving host cost is the graph **rebuild** (0040/0041), and the + irreducible floor is the per-step logits-D2H-before-sampling serial point - none + of which is the MoE-routing readback. - **Lever 3 - act-quant fusion: FLAT.** The W4A4 act-quant tax is removable only by W4A16 (a precision change, rejected) or a structural kernel rewrite; no further bit-exact lever clears it. 0023 already banks the de-dup. diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0041-feat-paged-S3-decode-shape-stable-scheduling-patch-0.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0041-feat-paged-S3-decode-shape-stable-scheduling-patch-0.patch index 39867f1fa..801c2c574 100644 --- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0041-feat-paged-S3-decode-shape-stable-scheduling-patch-0.patch +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0041-feat-paged-S3-decode-shape-stable-scheduling-patch-0.patch @@ -1,4 +1,4 @@ -From ee8021b56ed0effe493a64aa50449ab928dd6b29 Mon Sep 17 00:00:00 2001 +From ddff2279f23f18cadfbbb907a397d66b3609e9cd Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Sun, 28 Jun 2026 20:00:24 +0200 Subject: [PATCH 41/41] feat(paged): S3 decode-shape-stable scheduling (patch @@ -23,30 +23,36 @@ budget; no new slot states, no batch-formation rewrite, zero libllama changes. BIT-EXACT: only changes WHICH step a prompt chunk is admitted in. Each sequence's decode logits depend on its own tokens + its own KV only (the paged decode read is per-stream, attention is permutation-invariant over the co-batched set), so -deferring another slot's prefill never changes a generating slot's output. -DEFAULT-ON under paged KV: with LLAMA_KV_PAGED set this enables by default -(LLAMA_PAGED_DECODE_STABLE=0 forces it off); otherwise unset => byte-identical to -patch 0016. Does +deferring another slot's prefill never changes a generating slot's output. Does not run in the single-sequence greedy md5 gate (that path is llama-completion). +DEFAULT-OFF (A/B finding): a measured end-to-end A/B proved that making S3 +default-on under paged KV is a serving mistake. Deferring prefill admission on the +period-8 cadence defers prompt admission: 2.5x worse TTFT (60s vs 24s at N=256) +and 20-29% lower end-to-end throughput, with no end-to-end win at any concurrency. +Its apparent decode_agg gain was a metric artifact (faster per-step decode bought +by starving prefill). So S3 now defaults OFF (prefer prompt prefill admission for +good TTFT) and is opt-in via LLAMA_PAGED_DECODE_STABLE=1, intended only for +decode-dominated, low-arrival traffic where TTFT is not a concern. With +LLAMA_PAGED_DECODE_STABLE unset => byte-identical to patch 0016. + Measured (GB10, MoE Qwen3.6-35B-A3B-NVFP4, 128-client staggered streaming load): S1+S3 vs baseline (graphs rebuilt every step): graph reuse 0% -> 72.2%, hostproc 15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, -at vLLM's ~5.9 sustained). Remaining 28% rebuilds are request-boundary D/seq-set -churn + the prefill-cadence steps; closing them needs a padded/fixed-slot decode -shape (scoped follow-up, see DECODE_SERVING_SCOPE.md). +at vLLM's ~5.9 sustained). NOTE these are per-step decode metrics; the A/B above +shows they do not translate to an end-to-end serving win, hence default-off. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- - tools/server/server-context.cpp | 40 ++++++++++++++++++++++++++++++++- - 1 file changed, 39 insertions(+), 1 deletion(-) + tools/server/server-context.cpp | 46 ++++++++++++++++++++++++++++++++- + 1 file changed, 45 insertions(+), 1 deletion(-) diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp -index 64775dc..fc0231a 100644 +index 64775dc..a77e267 100644 --- a/tools/server/server-context.cpp +++ b/tools/server/server-context.cpp -@@ -3138,11 +3138,49 @@ private: +@@ -3138,11 +3138,55 @@ private: } int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots) @@ -66,16 +72,22 @@ index 64775dc..fc0231a 100644 + // Each sequence's decode logits depend on its own tokens + its own KV only + // (the paged decode read is per-stream, attention is permutation-invariant + // over the co-batched set), so deferring another slot's prefill never -+ // changes a generating slot's output. DEFAULT-ON under paged KV: with -+ // LLAMA_KV_PAGED set it enables by default (LLAMA_PAGED_DECODE_STABLE=0 -+ // forces off); otherwise byte-identical to patch 0016. Does not run in the -+ // single-sequence greedy md5 gate (that path is llama-completion, not update_slots). ++ // changes a generating slot's output. Does not run in the single-sequence ++ // greedy md5 gate (that path is llama-completion, not update_slots). ++ // ++ // DEFAULT-OFF (A/B finding): an end-to-end A/B proved S3-on is a serving ++ // mistake. Deferring prefill admission on the period-8 cadence delays prompt ++ // admission: 2.5x worse TTFT (60s vs 24s at N=256) and 20-29% lower end-to-end ++ // throughput, with no end-to-end win at any concurrency. Its apparent ++ // decode_agg gain was a metric artifact (faster per-step decode bought by ++ // starving prefill). So the default prefers prompt prefill admission for good ++ // TTFT; S3 is opt-in (LLAMA_PAGED_DECODE_STABLE=1) only for decode-dominated, ++ // low-arrival traffic where TTFT is not a concern. + bool decode_only_step = false; + { + static const int s3_enabled = [](){ + const char * e = getenv("LLAMA_PAGED_DECODE_STABLE"); -+ if (e) { return atoi(e); } // explicit override (=0 forces off) -+ return getenv("LLAMA_KV_PAGED") != nullptr ? 1 : 0; // default ON under paged KV ++ return e ? atoi(e) : 0; // default OFF; opt-in via LLAMA_PAGED_DECODE_STABLE=1 + }(); + if (s3_enabled && n_decode_in_batch > 0) { + static const int s3_period = [](){ const char * e = getenv("LLAMA_PAGED_PREFILL_PERIOD"); int p = e ? atoi(e) : 8; return p > 0 ? p : 8; }();