Files
LocalAI/backend/cpp
Ettore Di Giacinto 5a38dd3f09 docs(paged): adversarial review of the continuous-batch scheduler scope
Append a source-verified Review / risk section to
CONTINUOUS_BATCH_SCHEDULER_SCOPE.md. Verdict: scope is sound, GO on P0 ->
P1, conditional P2, separate-track P3.

Key checks against HEAD 151343b:
- Tractability: zero libllama changes. The mixed per-seq prefill+decode
  ubatch is the existing shipping path (common_batch_add per-token pos/seq,
  init_batch split, paged_alloc is hooks on the same llama_kv_cache class,
  not a new class). The new scheduler changes only the prefill token count,
  never the batch structure.
- The real serving config is kv_unified=false (-> n_stream=n_seq_max=128),
  so the split path is split_equal(sequential=true), not the contiguous
  split_simple the pseudocode implies. Fold into P0 ubatch-shape and
  determinism analysis; lock the split path in the A/B.
- CUDA graphs ruled out: both NVFP4 H2H vLLM servers ran --enforce-eager
  (cudagraph_mode=NONE), so the npl128 2.4x decode gap is genuine
  eager-kernel + per-step host overhead. Scheduler cannot close it; the
  157/333 ceiling stands.
- TTFT root quantified: prefill_tps collapses with concurrency for llama
  (dense 1117->125) while vLLM holds flat ~1420. The dynamic T-D budget
  attacks this directly and can sustain prefill_tps >= vLLM during the
  drain, so burst-TTFT parity is mechanically plausible, but it couples to
  a decode-ITL knob (T) that MUST be co-reported with TTFT.

Two calibration fixes required before P1: co-report drain-phase decode-ITL
with TTFT (stop charging/selling the steady-state decode_agg number), and
acknowledge the split_equal/n_stream=128 path. Neither changes the go
decision. P1 is the minimal high-ROI step (handful of line edits at named
seams); gate P2 on P1 metrics; P3 (kernel/CUDA-graph) owns the 2.4x
residual independent of the scheduler.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-23 22:48:31 +00:00
..