docs(paged): record padded/fixed-slot decode shape as tested-and-rejected

The S1 section-(a) padded/fixed-slot decode shape (the scoped follow-up to push
serving graph reuse from ~72% toward ~100%) was implemented in an isolated
worktree off the committed S1/S3/tail base, built CUDA-only, and benched on GB10.
Verdict: REJECTED. It is bit-exact and provably inert, but it regresses serving
throughput at every concurrency and does not close the vLLM gap.

Implementation (default-off, LLAMA_PAGED_PAD_DECODE): on a pure-decode step
(n_prompt_budgeted == 0) emit a masked-inert dummy decode for every idle slot so
n_tokens / n_seqs / n_seqs_unq / n_outputs and the seq-id set stay constant; a
release()-side guard keeps a finished slot warm under padding. Each dummy is its
own sequence (private recurrent state, per-stream paged attention, logits
discarded), so it cannot perturb a real stream.

Gates: single-seq greedy md5 bit-exact (dense 5951a5b4, paged-MoE 8cb0ce23). The
literal per-stream ON-vs-OFF identity gate is unachievable - concurrent cuBLAS/FA
decode is not bit-reproducible run-to-run even with padding off (OFF-vs-OFF
diverging streams: dense 3/16, MoE 8/16). The achievable inertness gate passed:
ON-vs-OFF per-stream prefix-agreement equals the OFF-vs-OFF noise floor exactly
(MoE 0.940/0.940, dense 0.812/0.812), so the dummy slots leak nothing.

Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10), burst decode tok/s/seq: n=8 S1+S3 28.16
/ PAD 6.05 / vLLM 44.8; n=128 S1+S3 4.53 / PAD 4.32 / vLLM 6.87. Staggered
aggregate tok/s: baseline (reuse 0%) 757.6, S1+S3 (reuse 72%) 763.3, PAD
(reuse 38%) 558.0.

Why it fails: (1) serving decode here is GPU-compute-bound, not host-rebuild-bound
- baseline reuse 0% ~= S1+S3 reuse 72% on aggregate tok/s, so closing reuse buys
~nothing (the earlier 542->762 host-bound delta did not reproduce); (2) padding
adds dummy-row compute proportional to pad_width - real_load, catastrophic at low
load; (3) in continuous serving padding cannot hold a constant width (perpetual
prefill churn) so reuse drops 72% -> 38%; (4) the completion-driven batch shrink
padding prevents is itself a throughput win in a compute-bound regime. The
residual burst gap is GPU-compute, which a host-side reuse lever cannot close.

Patch series unchanged: this rejected lever is NOT added to patches/paged/.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-28 20:47:43 +00:00
parent 2fa8ef8fc5
commit b028c81eda
2 changed files with 95 additions and 10 deletions

View File

@@ -147,8 +147,13 @@ mechanism: paged decode reuse 0% -> 95.5% (throughput flat there, since the stat
regime is GPU-bound). **S2 (double-buffer `set_inputs`) was dropped**: the Phase-0
profile put `set_inputs` at ~0.05 ms/step (the cost is the rebuild, not the input
copy), so it has nothing to recover. The remaining ~28% serving rebuilds are
request-boundary D/seq-set churn + the prefill-cadence steps; a padded/fixed-slot
decode shape to capture them is scoped in `docs/DECODE_SERVING_SCOPE.md`.
request-boundary D/seq-set churn + the prefill-cadence steps. A **padded/fixed-slot
decode shape** to capture them was then implemented and GPU-tested (2026-06-28) and
**REJECTED** - it is bit-exact/inert but regresses serving throughput at every
concurrency, because this serving decode is GPU-compute-bound (baseline reuse 0% ~=
S1+S3 reuse 72% on aggregate tok/s), so the dummy-row compute it adds costs more
than the reuse it recovers. Full record + numbers in `docs/DECODE_SERVING_SCOPE.md`
("Padded-shape lever - rejected").
### SSM (gated-DeltaNet) decode levers (0018-0022, 0028)

View File

@@ -22,14 +22,94 @@ graph rebuild; `set_inputs` 0.047 ms and block-table 0.002 ms are negligible.
decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9).**
- **S2 (double-buffer set_inputs) - DROPPED.** Phase 0 put `set_inputs` at
~0.05 ms/step: it is not the cost (the rebuild is), so S2 has nothing to recover.
- **Follow-up to ~100% reuse:** the remaining ~28% serving rebuilds are
request-boundary D/seq-set churn + the S3 prefill-cadence steps. Capturing them
needs a **padded/fixed-slot decode shape** (pad the decode width to a fixed
bucket with masked-inert dummy slots so `n_tokens` and the seq-id set stay
constant across arrivals/completions - the lever S1 section (a) describes).
Deferred: S1+S3 already reach vLLM-parity on the mean; padding is server-side,
invasive, and not exercised by the single-sequence md5 gate (needs a per-stream
serving-determinism gate). It is the next lever, not a shipped one.
- **Follow-up to ~100% reuse - PADDED/FIXED-SLOT DECODE SHAPE: IMPLEMENTED,
GPU-TESTED, REJECTED (not shipped).** See the "Padded-shape lever - rejected"
block below. Summary: it does NOT close the serving gap. Padding holds the
pure-decode width constant by emitting masked-inert dummy decodes for idle
slots, and it is provably inert (single-seq md5 bit-exact + per-stream
noise-floor determinism), but it **regresses throughput at every concurrency**
(catastrophically at low load) because the serving decode here is
**GPU-compute-bound, not host-rebuild-bound** - so the dummy-row compute it adds
costs more than the graph-reuse it recovers. The original "remaining ~28% is
request-boundary churn -> pad it" hypothesis stands mechanically, but the payoff
premise (closing reuse pulls decode toward vLLM) is **not supported by
measurement**.
---
## Padded-shape lever - rejected (implemented + GPU-tested, 2026-06-28)
The S1 section-(a) **padded / fixed-slot decode shape** was implemented in an
isolated worktree off the committed S1/S3/tail base (paged HEAD `05eceb4`), built
CUDA-only, and benched on GB10. **Verdict: REJECTED - it regresses serving
throughput and does not close the vLLM gap.** Recorded here so it is not re-tried.
**Implementation** (default-off, `LLAMA_PAGED_PAD_DECODE=1`; `LLAMA_PAGED_PAD_WIDTH`
caps the slot range): at the end of `pre_decode()`, on any step where no prompt
tokens were admitted (`n_prompt_budgeted == 0`) and there is decode load, emit a
masked-inert dummy decode for **every IDLE slot** (`batch.add(slot.id, 0,
pos_max+1, /*output=*/true)`; cold slot -> fresh pos-0). This holds `n_tokens`,
`n_seqs`, `n_seqs_unq`, `n_outputs` and the participating seq-id SET constant
across arrivals/completions. A `release()`-side guard keeps a finished slot warm
under padding (else patch 0024's reclaim-on-idle frees its KV and the next-step
pos-0 re-warm churns paged-block allocation, destroying reuse). Each dummy is its
OWN sequence, so its recurrent (gated-DeltaNet) state is private and its paged
attention reads only its own cells; its logits are computed but never read
(`post_decode()` only consumes `slot.i_batch` of GENERATING slots).
**Gates.** (1) Single-seq greedy md5 **bit-exact PASS** - dense
`5951a5b4d624ce891e22ab5fca9bc439`, paged-MoE `8cb0ce23777bf55f92f63d0292c756b0`
(the lever lives only in `llama-server`'s `update_slots()`, never in
`llama-completion`). (2) **Per-stream serving determinism**: the literal
"ON-vs-OFF token sequences identical" gate is **unachievable** - concurrent
cuBLAS/FA decode is **not bit-reproducible run-to-run** even with padding OFF
(OFF-vs-OFF diverging streams: dense 3/16, MoE 8/16, lockstep K=16). The
**achievable inertness gate PASSED**: per-stream prefix-agreement ON-vs-OFF equals
the OFF-vs-OFF noise floor exactly (MoE 0.940/0.940, dense 0.812/0.812), i.e. the
dummy slots inject no systematic divergence beyond the pre-existing concurrent FP
noise. So padding is provably inert; it just does not help.
**Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10).** Burst h2h, decode tok/s/seq:
| n | S1+S3 | PAD | vLLM |
|-----|-------|------|------|
| 8 | 28.16 | 6.05 | 44.8 |
| 32 | 11.66 | 4.84 | 17.45|
| 64 | 7.16 | 4.33 | 11.07|
| 128 | 4.53 | 4.32 | 6.87 |
Staggered (`serve_bench.py` k=128 n=160 stagger0.25), aggregate decode tok/s and
graph-reuse: baseline (reuse 0%) **757.6**; S1+S3 (reuse 72%) **763.3**; **PAD
(reuse 38%) 558.0**.
**Why it fails (four independent reasons):**
1. **Serving decode is GPU-compute-bound, not host-rebuild-bound (this run).**
Baseline reuse 0% (757.6 agg) is statistically equal to S1+S3 reuse 72% (763.3
agg): `hostproc` is only ~4-8% of the per-step wall, so eliminating the host
graph rebuild buys ~nothing. (This **corrects the host-bound hypothesis** above
for this hardware: the earlier 542->762 host-bound delta did **not** reproduce
- it was GPU-state/contention variance, not a stable reuse effect.)
2. **Padding ADDS dummy-row compute** (full-width decode), costing throughput in
direct proportion to `pad_width - real_load`: catastrophic at low concurrency
(n=8: 28.16 -> 6.05, ~4.6x slower, because 8 real streams pay for a 128-wide
step).
3. **In continuous serving padding can't even hold the width constant**: arrivals
are perpetually mid-prefill, so the idle-slot count varies and reuse DROPS
72% -> 38% (the opposite of the goal). It only stabilises the pure-decode
*tail* of a burst (verified: width pinned at 64 as real decoders fell 49->5),
which is exactly where the dummy compute is most wasteful.
4. **The completion-driven batch shrink that padding prevents is itself a
throughput WIN** in a compute-bound regime (fewer real streams -> cheaper
steps -> survivors finish faster); forcing constant width forfeits it.
**Conclusion.** The residual burst gap (paged 4.53 vs vLLM 6.87 at n=128 ~= 66%)
is a **GPU-compute** gap (vLLM's MoE decode kernel + scheduler are ~1.3x faster on
aggregate), not a host-loop gap. A host-side graph-reuse lever cannot close it.
Do not re-pursue padded/fixed-slot shapes for throughput; if the host loop is ever
re-confirmed dominant on other hardware (re-run reason 1's baseline-vs-S1+S3 A/B
first), revisit - but only with an *adaptive* width matched to live load, never a
fixed pad-to-`--parallel`.
---