docs(paged): record padded/fixed-slot decode shape as tested-and-rejected

The S1 section-(a) padded/fixed-slot decode shape (the scoped follow-up to push serving graph reuse from ~72% toward ~100%) was implemented in an isolated worktree off the committed S1/S3/tail base, built CUDA-only, and benched on GB10. Verdict: REJECTED. It is bit-exact and provably inert, but it regresses serving throughput at every concurrency and does not close the vLLM gap. Implementation (default-off, LLAMA_PAGED_PAD_DECODE): on a pure-decode step (n_prompt_budgeted == 0) emit a masked-inert dummy decode for every idle slot so n_tokens / n_seqs / n_seqs_unq / n_outputs and the seq-id set stay constant; a release()-side guard keeps a finished slot warm under padding. Each dummy is its own sequence (private recurrent state, per-stream paged attention, logits discarded), so it cannot perturb a real stream. Gates: single-seq greedy md5 bit-exact (dense 5951a5b4, paged-MoE 8cb0ce23). The literal per-stream ON-vs-OFF identity gate is unachievable - concurrent cuBLAS/FA decode is not bit-reproducible run-to-run even with padding off (OFF-vs-OFF diverging streams: dense 3/16, MoE 8/16). The achievable inertness gate passed: ON-vs-OFF per-stream prefix-agreement equals the OFF-vs-OFF noise floor exactly (MoE 0.940/0.940, dense 0.812/0.812), so the dummy slots leak nothing. Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10), burst decode tok/s/seq: n=8 S1+S3 28.16 / PAD 6.05 / vLLM 44.8; n=128 S1+S3 4.53 / PAD 4.32 / vLLM 6.87. Staggered aggregate tok/s: baseline (reuse 0%) 757.6, S1+S3 (reuse 72%) 763.3, PAD (reuse 38%) 558.0. Why it fails: (1) serving decode here is GPU-compute-bound, not host-rebuild-bound - baseline reuse 0% ~= S1+S3 reuse 72% on aggregate tok/s, so closing reuse buys ~nothing (the earlier 542->762 host-bound delta did not reproduce); (2) padding adds dummy-row compute proportional to pad_width - real_load, catastrophic at low load; (3) in continuous serving padding cannot hold a constant width (perpetual prefill churn) so reuse drops 72% -> 38%; (4) the completion-driven batch shrink padding prevents is itself a throughput win in a compute-bound regime. The residual burst gap is GPU-compute, which a host-side reuse lever cannot close. Patch series unchanged: this rejected lever is NOT added to patches/paged/. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-30 03:17:01 -04:00 · 2026-06-28 20:47:43 +00:00
parent 2fa8ef8fc5
commit b028c81eda
2 changed files with 95 additions and 10 deletions
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -147,8 +147,13 @@ mechanism: paged decode reuse 0% -> 95.5% (throughput flat there, since the stat
 regime is GPU-bound). **S2 (double-buffer `set_inputs`) was dropped**: the Phase-0
 profile put `set_inputs` at ~0.05 ms/step (the cost is the rebuild, not the input
 copy), so it has nothing to recover. The remaining ~28% serving rebuilds are
-request-boundary D/seq-set churn + the prefill-cadence steps; a padded/fixed-slot
-decode shape to capture them is scoped in `docs/DECODE_SERVING_SCOPE.md`.
+request-boundary D/seq-set churn + the prefill-cadence steps. A **padded/fixed-slot
+decode shape** to capture them was then implemented and GPU-tested (2026-06-28) and
+**REJECTED** - it is bit-exact/inert but regresses serving throughput at every
+concurrency, because this serving decode is GPU-compute-bound (baseline reuse 0% ~=
+S1+S3 reuse 72% on aggregate tok/s), so the dummy-row compute it adds costs more
+than the reuse it recovers. Full record + numbers in `docs/DECODE_SERVING_SCOPE.md`
+("Padded-shape lever - rejected").

 ### SSM (gated-DeltaNet) decode levers (0018-0022, 0028)

--- a/backend/cpp/llama-cpp-localai-paged/docs/DECODE_SERVING_SCOPE.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/DECODE_SERVING_SCOPE.md
@@ -22,14 +22,94 @@ graph rebuild; `set_inputs` 0.047 ms and block-table 0.002 ms are negligible.
  decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9).**
 - **S2 (double-buffer set_inputs) - DROPPED.** Phase 0 put `set_inputs` at
  ~0.05 ms/step: it is not the cost (the rebuild is), so S2 has nothing to recover.
- **Follow-up to ~100% reuse:** the remaining ~28% serving rebuilds are
-  request-boundary D/seq-set churn + the S3 prefill-cadence steps. Capturing them
-  needs a **padded/fixed-slot decode shape** (pad the decode width to a fixed
-  bucket with masked-inert dummy slots so `n_tokens` and the seq-id set stay
-  constant across arrivals/completions - the lever S1 section (a) describes).
-  Deferred: S1+S3 already reach vLLM-parity on the mean; padding is server-side,
-  invasive, and not exercised by the single-sequence md5 gate (needs a per-stream
-  serving-determinism gate). It is the next lever, not a shipped one.
+- **Follow-up to ~100% reuse - PADDED/FIXED-SLOT DECODE SHAPE: IMPLEMENTED,
+  GPU-TESTED, REJECTED (not shipped).** See the "Padded-shape lever - rejected"
+  block below. Summary: it does NOT close the serving gap. Padding holds the
+  pure-decode width constant by emitting masked-inert dummy decodes for idle
+  slots, and it is provably inert (single-seq md5 bit-exact + per-stream
+  noise-floor determinism), but it **regresses throughput at every concurrency**
+  (catastrophically at low load) because the serving decode here is
+  **GPU-compute-bound, not host-rebuild-bound** - so the dummy-row compute it adds
+  costs more than the graph-reuse it recovers. The original "remaining ~28% is
+  request-boundary churn -> pad it" hypothesis stands mechanically, but the payoff
+  premise (closing reuse pulls decode toward vLLM) is **not supported by
+  measurement**.
+
+---
+
+## Padded-shape lever - rejected (implemented + GPU-tested, 2026-06-28)
+
+The S1 section-(a) **padded / fixed-slot decode shape** was implemented in an
+isolated worktree off the committed S1/S3/tail base (paged HEAD `05eceb4`), built
+CUDA-only, and benched on GB10. **Verdict: REJECTED - it regresses serving
+throughput and does not close the vLLM gap.** Recorded here so it is not re-tried.
+
+**Implementation** (default-off, `LLAMA_PAGED_PAD_DECODE=1`; `LLAMA_PAGED_PAD_WIDTH`
+caps the slot range): at the end of `pre_decode()`, on any step where no prompt
+tokens were admitted (`n_prompt_budgeted == 0`) and there is decode load, emit a
+masked-inert dummy decode for **every IDLE slot** (`batch.add(slot.id, 0,
+pos_max+1, /*output=*/true)`; cold slot -> fresh pos-0). This holds `n_tokens`,
+`n_seqs`, `n_seqs_unq`, `n_outputs` and the participating seq-id SET constant
+across arrivals/completions. A `release()`-side guard keeps a finished slot warm
+under padding (else patch 0024's reclaim-on-idle frees its KV and the next-step
+pos-0 re-warm churns paged-block allocation, destroying reuse). Each dummy is its
+OWN sequence, so its recurrent (gated-DeltaNet) state is private and its paged
+attention reads only its own cells; its logits are computed but never read
+(`post_decode()` only consumes `slot.i_batch` of GENERATING slots).
+
+**Gates.** (1) Single-seq greedy md5 **bit-exact PASS** - dense
+`5951a5b4d624ce891e22ab5fca9bc439`, paged-MoE `8cb0ce23777bf55f92f63d0292c756b0`
+(the lever lives only in `llama-server`'s `update_slots()`, never in
+`llama-completion`). (2) **Per-stream serving determinism**: the literal
+"ON-vs-OFF token sequences identical" gate is **unachievable** - concurrent
+cuBLAS/FA decode is **not bit-reproducible run-to-run** even with padding OFF
+(OFF-vs-OFF diverging streams: dense 3/16, MoE 8/16, lockstep K=16). The
+**achievable inertness gate PASSED**: per-stream prefix-agreement ON-vs-OFF equals
+the OFF-vs-OFF noise floor exactly (MoE 0.940/0.940, dense 0.812/0.812), i.e. the
+dummy slots inject no systematic divergence beyond the pre-existing concurrent FP
+noise. So padding is provably inert; it just does not help.
+
+**Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10).** Burst h2h, decode tok/s/seq:
+
+| n   | S1+S3 | PAD  | vLLM |
+|-----|-------|------|------|
+| 8   | 28.16 | 6.05 | 44.8 |
+| 32  | 11.66 | 4.84 | 17.45|
+| 64  | 7.16  | 4.33 | 11.07|
+| 128 | 4.53  | 4.32 | 6.87 |
+
+Staggered (`serve_bench.py` k=128 n=160 stagger0.25), aggregate decode tok/s and
+graph-reuse: baseline (reuse 0%) **757.6**; S1+S3 (reuse 72%) **763.3**; **PAD
+(reuse 38%) 558.0**.
+
+**Why it fails (four independent reasons):**
+
+1. **Serving decode is GPU-compute-bound, not host-rebuild-bound (this run).**
+   Baseline reuse 0% (757.6 agg) is statistically equal to S1+S3 reuse 72% (763.3
+   agg): `hostproc` is only ~4-8% of the per-step wall, so eliminating the host
+   graph rebuild buys ~nothing. (This **corrects the host-bound hypothesis** above
+   for this hardware: the earlier 542->762 host-bound delta did **not** reproduce
+   - it was GPU-state/contention variance, not a stable reuse effect.)
+2. **Padding ADDS dummy-row compute** (full-width decode), costing throughput in
+   direct proportion to `pad_width - real_load`: catastrophic at low concurrency
+   (n=8: 28.16 -> 6.05, ~4.6x slower, because 8 real streams pay for a 128-wide
+   step).
+3. **In continuous serving padding can't even hold the width constant**: arrivals
+   are perpetually mid-prefill, so the idle-slot count varies and reuse DROPS
+   72% -> 38% (the opposite of the goal). It only stabilises the pure-decode
+   *tail* of a burst (verified: width pinned at 64 as real decoders fell 49->5),
+   which is exactly where the dummy compute is most wasteful.
+4. **The completion-driven batch shrink that padding prevents is itself a
+   throughput WIN** in a compute-bound regime (fewer real streams -> cheaper
+   steps -> survivors finish faster); forcing constant width forfeits it.
+
+**Conclusion.** The residual burst gap (paged 4.53 vs vLLM 6.87 at n=128 ~= 66%)
+is a **GPU-compute** gap (vLLM's MoE decode kernel + scheduler are ~1.3x faster on
+aggregate), not a host-loop gap. A host-side graph-reuse lever cannot close it.
+Do not re-pursue padded/fixed-slot shapes for throughput; if the host loop is ever
+re-confirmed dominant on other hardware (re-run reason 1's baseline-vs-S1+S3 A/B
+first), revisit - but only with an *adaptive* width matched to live load, never a
+fixed pad-to-`--parallel`.

 ---