docs(paged): adversarial review of the continuous-batch scheduler scope

Append a source-verified Review / risk section to CONTINUOUS_BATCH_SCHEDULER_SCOPE.md. Verdict: scope is sound, GO on P0 -> P1, conditional P2, separate-track P3. Key checks against HEAD 151343b: - Tractability: zero libllama changes. The mixed per-seq prefill+decode ubatch is the existing shipping path (common_batch_add per-token pos/seq, init_batch split, paged_alloc is hooks on the same llama_kv_cache class, not a new class). The new scheduler changes only the prefill token count, never the batch structure. - The real serving config is kv_unified=false (-> n_stream=n_seq_max=128), so the split path is split_equal(sequential=true), not the contiguous split_simple the pseudocode implies. Fold into P0 ubatch-shape and determinism analysis; lock the split path in the A/B. - CUDA graphs ruled out: both NVFP4 H2H vLLM servers ran --enforce-eager (cudagraph_mode=NONE), so the npl128 2.4x decode gap is genuine eager-kernel + per-step host overhead. Scheduler cannot close it; the 157/333 ceiling stands. - TTFT root quantified: prefill_tps collapses with concurrency for llama (dense 1117->125) while vLLM holds flat ~1420. The dynamic T-D budget attacks this directly and can sustain prefill_tps >= vLLM during the drain, so burst-TTFT parity is mechanically plausible, but it couples to a decode-ITL knob (T) that MUST be co-reported with TTFT. Two calibration fixes required before P1: co-report drain-phase decode-ITL with TTFT (stop charging/selling the steady-state decode_agg number), and acknowledge the split_equal/n_stream=128 path. Neither changes the go decision. P1 is the minimal high-ROI step (handful of line edits at named seams); gate P2 on P1 metrics; P3 (kernel/CUDA-graph) owns the 2.4x residual independent of the scheduler. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 01:47:18 -04:00 · 2026-06-23 22:48:31 +00:00
parent ed17fc804e
commit 5a38dd3f09
1 changed files with 124 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/CONTINUOUS_BATCH_SCHEDULER_SCOPE.md
+++ b/backend/cpp/llama-cpp/patches/paged/CONTINUOUS_BATCH_SCHEDULER_SCOPE.md
@@ -373,3 +373,127 @@ scheduler.patch` (next free slot after 0015) and the LocalAI option lands in `gr
 beside `max_prefill_tokens`. Commit with `git commit -s`, trailer
 `Assisted-by: Claude:opus-4.8 [Claude Code]`, no `Co-Authored-By`, no em-dashes. Do not push
 (human pushes).
+
+---
+
+## Review / risk (adversarial, source-verified)
+
+Skeptical staff review against the actual source at HEAD `151343b` (server-context.cpp,
+llama-batch.cpp, llama-kv-cache.cpp, paged-*.cpp), grpc-server.cpp in this worktree, and the
+committed `QWEN36_NVFP4_BENCH.md` plus the vLLM H2H serve logs/scripts on the box.
+
+### Verdict: the scope is SOUND. GO on P0 -> P1, CONDITIONAL P2, separate-track P3.
+
+The central de-risking claims check out against the code, and the load-bearing honesty (decode
+residual is a kernel ceiling, not a scheduler defect) is correct and now further corroborated.
+Two calibration fixes are required before P1 (below), neither changes the go decision.
+
+### (1) Tractability - CONFIRMED bounded; zero libllama changes. What enables/blocks it, concretely:
+
+- **Enables (already-exercised path, not new surface).** A mixed prefill+decode ubatch with
+  per-seq different `n_past` is the *existing* behaviour. `llama_batch` carries per-token `pos`
+  and `seq_id` (`common_batch_add(batch, tok, pos_next(), {slot.id}, ...)`); `llama_kv_cache` +
+  `paged_alloc::place()` place each `(seq, pos)` independently; `llama_kv_cache::init_batch`
+  (line 742) already splits the mixed batch into ubatches. **The server emits exactly this mixed
+  decode+prefill batch today** - patch 0013 ships it and produces coherent output - so the new
+  scheduler changes only the *count* of prefill tokens, never the batch *structure*. There is no
+  `llama_decode`/ubatch/KV rewrite in scope.
+- **Blocks: nothing in libllama.** The only constraints are pre-existing and orthogonal to the
+  target workload: (i) `can_batch_with` (same task type + equal LoRA per batch); (ii)
+  `split_equal(sequential=true)` errors on *coupled* sequences (shared-prompt parallel sampling),
+  forcing `-kvu`. Neither is introduced by this change.
+- **Correction to fold in:** the scope's [C] and the pseudocode imply contiguous `split_simple`
+  chunking. The real serving/benchmark config (`--parallel 128`, `kv_unified` default = `false`
+  -> `n_stream = n_seq_max = 128`) takes the **`split_equal(n_ubatch, sequential=true)`** path
+  (llama-kv-cache.cpp:742), which balances per-sequence rather than slicing contiguously. This
+  does not break anything (0013 already hits it) but it means the actual scheduled object is a
+  split_equal ubatch set; P0 must characterize that ubatch shape (not assume contiguous 512-chunks)
+  and the determinism band is over split_equal groupings. Lock the split path (unified vs not) in
+  the A/B so the byte-identical-to-0013 gate is meaningful. grpc seam [E] verified at
+  grpc-server.cpp:761-786 (`kv_paged`, `max_prefill_tokens`/`mpt`); new `mbt`/`prefill_cap` knobs
+  hang off it identically.
+
+### (2) Does it close the gap - the 2.4x is NOT CUDA graphs, and the TTFT root is quantified.
+
+- **CUDA graphs ruled out (verified).** Both NVFP4 H2H vLLM servers ran `--enforce-eager`
+  (`h2h_dense_vllm.sh`, `h2h_moe_serve_vllm.sh`; engine logs show `enforce_eager=True`,
+  `cudagraph_mode=NONE`, `CompilationMode.NONE`). So the npl128 2.4x decode gap is a genuine
+  **eager-mode kernel + per-step host-overhead** gap (ggml graph rebuild/realloc + ~1k kernel
+  launches per step on the weak Grace cores, paged-KV gather, MoE expert gather). The scheduler
+  cannot touch it; the staggered all-128-decoding 157.4 tok/s ceiling is solid. Scope is right to
+  refuse the 391/811 number. (CUDA graphs are a future *both-sides* lever, not the current cause.)
+- **The TTFT gap has a measured root the scope under-uses: prefill_tps collapse.** From the bench,
+  llama `prefill_tps` falls 1117 -> 752 -> 465 -> **125** (dense, npl 8/32/64/128) while vLLM holds
+  **flat ~1420** (MoE: 2813 -> 657 vs vLLM flat ~4263). That collapse - not a separate "scheduling
+  quality" abstraction - is the direct cause of the 491 s / 85 s TTFT, and it is exactly what the
+  dynamic `T - D` budget attacks: when decode load `D` is low (early in a burst) the leftover
+  `T - D` lets prefill take ~`n_batch` per step, and because llama's *larger per-step chunk*
+  compensates for its ~2.4x slower steps, a `T = 2048` budget can sustain prefill_tps at or above
+  vLLM's ~1420 during the drain. **So burst-TTFT parity is mechanically plausible, not just
+  "toward"** - the static budget-256 throttles prefill to 256/step (hence its weak 305 s) where the
+  dynamic budget would not. This strengthens P1's case beyond what the doc claims.
+- **Mandatory calibration fix:** that TTFT win **couples to a decode-ITL knob**. Spending the full
+  `T - D` on prefill during the drain makes those steps full `T`-token (mixed) computes, so
+  co-batched decoders get 1 token per slow step (ITL spike) *during the drain* - precisely vLLM's
+  tradeoff, navigated by `T`. The 157/333 ceiling is the **post-drain steady state**, not the
+  drain phase. Therefore the scope must **co-report drain-phase decode-ITL alongside TTFT** and
+  treat `T` as the published trade knob; reporting TTFT alone would hide the cost and reporting
+  decode_agg alone would hide the win (it is averaged across drain + steady state, which is why it
+  "barely moves"). Soften "P1+P2 reach 25 s / 8 s": the defensible claim is *staggered/realistic
+  arrival ~2 s, and all-at-once burst approaching vLLM with a tunable decode-ITL cost*.
+
+### (3) Correctness - paged orthogonality confirmed at source; the real risks are config, not code.
+
+- **Paged-KV is the same `llama_kv_cache` class** with `paged_alloc::` hooks inside the existing
+  find_slot/placement (llama-kv-cache.cpp:1043-1083), driven by per-slot `(seq, pos)` - which this
+  change does not touch. `init_batch`/split is paged-agnostic. The scope's "orthogonal" claim is
+  verified, not asserted. Keep the hard `LLAMA_KV_PAGED=1` vs `=0` identical-decisions gate.
+- **Determinism**: the FA grouping nondeterminism is over **split_equal** ubatches in the real
+  config; the `T = n_batch` A/B-must-be-byte-identical-to-0013 gate is the right oracle and is
+  sound (default-off path is untouched).
+- **Low-concurrency regression**: gated to byte-identical when knobs unset; the only live vector is
+  a **mis-tuned `T`** spiking ITL at low npl (the scope already flags `T` defaults). Config hygiene,
+  not a code risk. Add a guard/floor so `T` cannot be set below `n_ubatch`.
+
+### (4) Smaller higher-ROI step - yes, and the scope already contains it (P1).
+
+The minimal high-ROI change is **P1 alone**: replace the static read (server-context.cpp:2737-2747)
+with `prefill_budget_step = max(floor, T - batch.n_tokens)` computed after the decode-fill at line
+2719, and bound the Phase-2 loops by `T` / that budget (3188, 3320, 3326). That is a handful of
+line edits at named seams, default-off, and it captures the self-tuning + the bulk of the TTFT win.
+The even-smaller validation spike: a one-line `n_prefill_budget = max(floor, T - batch.n_tokens)`
+to confirm the prefill_tps/TTFT mechanism before writing the full P1. **P2** (round-robin +
+`prefill_cap_per_slot` + checkpoint-aware admission) is genuinely higher-effort and lower-marginal
+(it buys TTFT *spread*/tail and burst robustness, not the median); **gate P2 on P1's measured
+burst-TTFT-spread and drain-ITL**, do not commit to it up front. There is no smaller step that also
+fixes the static budget's npl-dependence - tuning 0013's constant cannot (256 is net-negative at
+npl8 and costs MoE TTFT), so P1 is the floor.
+
+### Realistic effort / payoff and sequencing
+
+- **P0** ~0.5-1 wk (harness largely exists in `~/bench/`): add drain-phase decode-ITL to the metric
+  set, lock the split path, isolate checkpoints (`n_ctx_checkpoints=0`). Gate only.
+- **P1** ~2-4 days: small diff + the A/B-vs-0013 byte-identical gate + the npl/dense/MoE sweep.
+  Payoff: self-tuning hold of 161/333 with no hand-picked constant; burst-TTFT 3-10x better than
+  0013 (plausibly approaching vLLM on the burst, parity on staggered), at a published `T`-tunable
+  decode-ITL cost. **This is the high-ROI core and the clean supersession of 0013.**
+- **P2** ~1-2 wk, conditional: fairness/admission + checkpoint-cost-awareness + tuning. Payoff: TTFT
+  tail/spread + no admission collapse under sustained load. Worth it only if P1 metrics show a
+  residual spread/robustness problem.
+- **P3** separate track, high effort: the *only* path to 391/811 is the eager-kernel + per-step
+  host-overhead residual. Highest-value probe is a **CUDA-graph capture of the steady-state
+  pure-decode step** - but note this works *independent of the scheduler* (the all-128-decoding
+  step is already fixed-shape today); the scheduler neither blocks nor specially enables it, so do
+  not credit graphs to the scheduler. The scope's "uniform decode step is a precondition" is a mild
+  over-claim; correct it to "graphs apply to the pure-decode steady state, which the scheduler does
+  not change."
+
+### Bottom line
+
+GO. The work is correctly localized to `update_slots()` batch-formation policy, requires no
+libllama changes (the mixed per-seq batch is the existing, shipping path), and supersedes 0013
+cleanly. The honest ceiling is real and well-stated; the two fixes are (a) co-report drain-phase
+decode-ITL with TTFT and stop selling/charging the decode_agg number, and (b) acknowledge the
+`split_equal`/`n_stream=128` path in the determinism and ubatch-shape analysis. Sequence
+P0 -> P1, measure, then decide P2; keep P3 (kernel/CUDA-graph) on its own track as the sole owner
+of the 2.4x throughput residual.