docs(paged): record P4 CBv2 NO-GO at the perf kill-gate

P4 (token-granular continuous-batching scheduler, LLAMA_CONTINUOUS_BATCH_V2, default-off) stopped honestly at the P0 perf kill-gate. The kill-gate subset (per-seq chunked-prefill cursors + adaptive decode bucketing, server-side only, zero ggml/ files, ~68 LOC + a new unit-tested server-admission-policy.h) was implemented and correctness-proven green (canonical md5 both models default-off AND cbv2-on: MoE 8cb0ce23, dense 5951a5b4; test-backend-ops MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46; cursor-interleave PROVEN via LLAMA_CBV2_TRACE with decode+prefill co-batched and per-seq cursors advancing across steps, dbucket==n_decode no-pad; determinism-NEUTRAL: CBv2 diverges from control no more than control diverges from itself, the paged concurrent-greedy path being inherently non-deterministic run-to-run in the baseline too). The kill-gate GO criterion - a >20% TTFT-under-load drop with md5 green and serving-aggregate not regressed - was NOT demonstrated: the staggered/burst TTFT A/B was force-terminated by the harness mid-run (CONTROL-only, 30/60 raws), so the TTFT deltas are not-yet-measured placeholders, not measured neutrality. Per the phased contract go=false was the kill-gate default: nothing built beyond P0 (no SLOT_STATE_PREEMPTED, no aging/starvation-freedom), nothing landed. This is the scope-anticipated outcome - P4 is a GB10 TTFT/fairness/enabler lever, not a throughput lever (decode is GPU-compute-bound), so a NO-GO on the TTFT gate is expected and any throughput payoff is non-GB10. Records the honest rejection in EXECUTION_REARCH_SCOPE.md (P4 RESULT subsection) and PARITY_HANDOFF.md chronology, including the re-score path: read the finalized DGX ~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md once the CANDIDATE arm completes; a genuine >20% staggered-TTFT drop clearing max(2%, 3*stdev) re-scores go=true and triggers the full P4 build-out. Fork localai-paged untouched at 653bb2f3d; LocalAI series stays at 46 patches; topic branch p4-cbv2 retained on the DGX fork at ebb649335 (base 653bb2f3d, not pushed). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 20:37:03 -04:00 · 2026-07-02 18:03:34 +00:00
parent 586639d016
commit 865e77c4ec
2 changed files with 177 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md
@@ -555,6 +555,102 @@ records P2-at-this-granularity as a confirmed floor.
 - **Upstream-clash / rebase-safety:** safest area. `tools/server/server-context.cpp` is
  a fork-owned tool, not ggml core; upstream churns it less and conflicts are mechanical.

+#### P4 RESULT (NO-GO at the P0 perf kill-gate, recorded 2026-07-02, `LLAMA_CONTINUOUS_BATCH_V2`, default-off)
+
+The CBv2 P0 kill-gate subset (per-seq chunked-prefill cursors + adaptive decode
+bucketing) was **implemented and correctness-proven green**, but the P0 kill-gate's
+stated GO criterion - a **> 20% TTFT-under-load drop** with md5 green and
+serving-aggregate not regressed - was **NOT demonstrated**, so per the phased
+contract `go=false` was the kill-gate default, **nothing was built beyond P0**
+(no `SLOT_STATE_PREEMPTED`, no aging/starvation-freedom), and **nothing landed.**
+The topic branch `p4-cbv2` is retained on the DGX fork at
+`ebb649335fe7686524a3630ee2fdffce44be6d52` (base `localai-paged` `653bb2f3d`, NOT
+pushed); the fork `localai-paged` HEAD is **untouched at `653bb2f3d`** and the
+LocalAI series stays at 46 patches (`0001-0055`). **This is the scope-anticipated
+outcome:** the P4 section frames CBv2 on GB10 as a TTFT + fairness + architecture-
+enabler lever, **not** a throughput lever (decode is GPU-compute-bound; the
+host-loop-dead measurement is real), so a NO-GO on the TTFT perf gate is the
+expected result and any throughput payoff lives on non-GB10 silicon (out of scope).
+
+- **WHY THE PERF GATE DID NOT FIRE GO (honest caveat, not a measured neutrality).**
+  The staggered/burst TTFT A/B was **force-terminated by the harness mid-run** with
+  only the CONTROL arm complete (30/60 raws; CANDIDATE arm never started).
+  Consequently `ttft_n32_stag`, `ttft_n128_stag`, `ttft_n8`, `ttft_burst`, and
+  `agg_delta_pct_worst` are **NOT-YET-MEASURED `0.0` placeholders**, not measured
+  neutrality. No affirmative `> 20%` staggered-TTFT drop was demonstrated, so the
+  kill-gate default (`go=false`) stands.
+- **CORRECTNESS GATES ALL GREEN (DGX GB10, arch sm_121a), the substantive P0
+  result.** Behind `LLAMA_CONTINUOUS_BATCH_V2=1` (default OFF, byte-identical off):
+  - **(a) canonical md5 GREEN both models, default-off AND cbv2-on:** paged-MoE
+    `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`.
+  - **(c) `test-backend-ops` GREEN (zero-ggml side-effect proof):** MUL_MAT
+    1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46.
+  - **(c) CURSOR INTERLEAVE PROVEN** (`LLAMA_CBV2_TRACE`, staggered N=20): steps
+    carry decode AND prefill tokens in the SAME batch with per-slot cursors
+    advancing across steps, not slot-exclusive. Verbatim step=6: `n_decode_toks=5
+    n_prefill_toks=1535 n_seqs=20` with 15 partial cursors; slot s112 advances
+    144/523 -> 281 -> 418 -> 519 over steps 6-9 while decode runs; adaptive
+    fair-share cap tracks live load (410@5waiting, 171@12, 137@15, 291@7, 508@4);
+    `dbucket==n_decode` confirms **no fixed pad-to-parallel** (per
+    `DECODE_SERVING_SCOPE.md` net-negative-on-GB10).
+  - **(b) SERVER DETERMINISM = CBv2 is NEUTRAL / correctness-preserving.** The
+    literal exact-reproducibility gate is unsatisfiable by ANY scheduler here: the
+    paged CONCURRENT greedy path is inherently non-deterministic run-to-run in the
+    BASELINE too (the control default scheduler diverges from itself), a pre-existing
+    benign near-tied-argmax / co-batch FP-reduction-order property
+    (`PAGED_BITEXACT_NOTE`), on both dense and MoE. The discriminating test - does
+    CBv2 diverge from control MORE than control diverges from itself - **PASSES**:
+    across 8 configs {dense,moe} x {degenerate,natural} x {gen8,gen64}, per-request
+    cross-arm divergence tracks the within-arm run-to-run baseline to +/-1-3 of 32
+    (small-count noise; e.g. MoE-natural gen64 base 31/32 worst-cross 31/32;
+    dense-degenerate base 14 cross 12-17). Single-sequence greedy is fully
+    deterministic (the md5 gate above).
+- **Implementation (kill-gate subset only; correct, committed on `p4-cbv2`, NOT
+  pushed; server-side only, ZERO `ggml/` files, ~68 LOC in `server-context.cpp` +
+  a new unit-tested header).** (1) Per-seq chunked-prefill cursors with a
+  **load-adaptive fair-share cap** = `ceil(prefill_leftover / n_waiting)` floored at
+  `LLAMA_CBV2_CHUNK_MIN` (default 128, deliberately NOT `n_ubatch` so a 512-token
+  prompt actually chunks under load); CBv2 activates the shipped 0016 decode-first
+  budget by default (`T=n_batch`, no `LLAMA_MAX_BATCH_TOKENS` needed) and replaces
+  0016's fixed cap with this fair-share cap; cursor = `slot.prompt.n_tokens()`
+  advancing across steps. (2) Adaptive decode bucket policy (`LLAMA_CBV2_DECODE_PAD`
+  default 0 => `bucket==n_decode`, no padding; policy computed+traced only, never
+  fed to batch formation, so bit-exact-safe; row-emission for host-bound silicon is
+  the deferred [Build phase]). Pure math lives in the NEW unit-tested header
+  `tools/server/server-admission-policy.h` (namespace `cbv2`) +
+  `server-admission-policy-test.cpp` (host-side unit tests ALL PASS local + DGX);
+  `server-context.cpp` is the thin integration; step trace under `LLAMA_CBV2_TRACE=1`.
+- **Honest delta vs expectation.** Kill-gate GO required TTFT-under-load to drop
+  `> 20%`; **delivered: not demonstrated** (perf A/B force-terminated control-only).
+  The correctness substrate (bit-exact md5, proven decode+prefill co-batching with
+  per-seq cursors, determinism-neutrality) is real and is the enabler the scope
+  values, but the perf axis that gates the phase was never measured to GO.
+- **WHAT WOULD CHANGE THE VERDICT (re-score path).** Read the finalized DGX
+  `~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md` once the CANDIDATE arm completes
+  (the perf driver `p4_agg.py` auto-writes medians+stdev deltas with the
+  `> 20%`-TTFT-drop GO logic baked in). **IF** it shows a genuine `> 20%`
+  staggered-TTFT drop clearing `max(2%, 3*stdev)` with md5 green and aggregate not
+  regressed, re-score `go=true` and trigger the **full P4 build-out**:
+  `SLOT_STATE_PREEMPTED` + release-KV-keep-prompt-tokens re-admit (reusing the paged
+  burst-reclaim patch 0024 + `paged-alloc.cpp` defrag), aging/starvation-freedom with
+  a constructed starvation test, preemption-transition + aging unit tests, and a
+  forced-preemption byte-identical-resume determinism gate. **ELSE** (the
+  scope-expected case) this NO-GO stands and P4 is deferred as a GB10 TTFT/fairness/
+  enabler lever whose throughput payoff is non-GB10.
+- **Series-numbering flag (for whoever lands a future GO).** The P0 code comments
+  label `[paged 0056]` per the pinned fork's next slot (46 patches), but the LocalAI
+  worktree README is already ahead at `0056-0061` (the MoE MMQ trace series) -
+  reconcile the actual series number on landing (likely `0062`).
+- **Artifacts (DGX `~/bench/p4_cbv2/`):** `build_20260702_192141/` (build.log);
+  `gates_20260702_192632/` (SUMMARY.txt: md5 x4, test-backend-ops, cbv2_trace.txt,
+  determinism tsvs); `det2_20260702_193123/` + `det3_20260702_193649/` +
+  `det4_20260702_194040/` (determinism diff-matrix: degenerate / natural / gen8);
+  `perf_20260702_194359/` (raw_*.json + auto-written RESULTS.md). Environment:
+  `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`, `LLAMA_MAX_BATCH_TOKENS` unset,
+  sm_121a, GPU lock held. Code on `p4-cbv2` `ebb649335`:
+  `tools/server/server-admission-policy.h`, `server-admission-policy-test.cpp`,
+  `server-context.cpp` (+68).
+
 ### P5: FLA-faithful GDN prefill scan (blocked solve_tril port; the algorithm never actually tested in-backend)

 - **Goal:** replace the hand f32 chunked scan (`gdn_core`, 95.7 us/tok, 2.62x vLLM) with
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -2474,3 +2474,84 @@ pushed). Artifacts on the DGX: `~/bench/p2_moe_region/focused_20260702_172644/`
 (sentinels 5x, correctness OFF+ON, md5, S_PP@512 5x, KL) + `RESULTS.txt`,
 `.../killgate_20260702_171826/` (engagement proof, 0x on both models),
 `.../build_20260702_145928/` (build logs).
+
+## P4 token-granular continuous-batching scheduler (CBv2) - NO-GO at the perf kill-gate (recorded 2026-07-02)
+
+Fourth phase of the `EXECUTION_REARCH_SCOPE.md` additive program. The P0 kill-gate
+subset for `LLAMA_CONTINUOUS_BATCH_V2` (default-off) was **implemented and
+correctness-proven green**, but the kill-gate's stated GO criterion - a **> 20%
+TTFT-under-load drop** with md5 green and serving-aggregate not regressed - was
+**NOT demonstrated**, so per the phased contract `go=false` was the kill-gate
+default, nothing was built beyond P0, and nothing landed. See the "P4 RESULT"
+subsection in `EXECUTION_REARCH_SCOPE.md` for the full record; summary and
+provenance:
+
+- **Verdict: NO-GO / DO-NOT-SHIP at the perf gate (scope-anticipated).** The P4
+  section frames CBv2 on GB10 as a TTFT + fairness + architecture-enabler lever,
+  **not** a throughput lever (decode is GPU-compute-bound; the host-loop-dead
+  measurement is real), so a NO-GO on the TTFT perf gate is the expected outcome and
+  any throughput payoff is non-GB10 (out of scope).
+- **Why the perf gate did not fire GO (honest caveat, not measured neutrality).**
+  The staggered/burst TTFT A/B was **force-terminated by the harness mid-run** with
+  only the CONTROL arm complete (30/60 raws; CANDIDATE never started), so the four
+  `ttft_*_delta_pct` and `agg_delta_pct_worst` fields are **NOT-YET-MEASURED `0.0`
+  placeholders**, not measured neutrality. No affirmative `> 20%` staggered-TTFT drop
+  was demonstrated; the kill-gate default (`go=false`) stands.
+- **Correctness gates all GREEN (DGX GB10, sm_121a), the substantive P0 result.**
+  Behind `LLAMA_CONTINUOUS_BATCH_V2=1` (default OFF, byte-identical off):
+  - **(a) canonical md5 GREEN both models, default-off AND cbv2-on:** paged-MoE
+    `8cb0ce23`, dense `5951a5b4`.
+  - **(c) `test-backend-ops` GREEN (zero-ggml side-effect proof):** MUL_MAT
+    1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46.
+  - **(c) cursor-interleave PROVEN** (`LLAMA_CBV2_TRACE`, staggered N=20): steps
+    co-batch decode AND prefill tokens with per-slot cursors advancing across steps
+    (step=6: `n_decode_toks=5 n_prefill_toks=1535 n_seqs=20`, 15 partial cursors;
+    slot s112 144/523 -> 281 -> 418 -> 519 over steps 6-9 while decode runs); adaptive
+    fair-share cap tracks live load (410@5w, 171@12, 137@15); `dbucket==n_decode` =>
+    no fixed pad-to-parallel.
+  - **(b) determinism = CBv2 NEUTRAL / correctness-preserving.** The paged concurrent
+    greedy path is inherently non-deterministic run-to-run in the BASELINE too (a
+    benign near-tied-argmax / co-batch FP-reduction-order property,
+    `PAGED_BITEXACT_NOTE`), so the literal exact-match gate is unsatisfiable by any
+    scheduler (control fails it too). The discriminating test - does CBv2 diverge
+    from control more than control diverges from itself - PASSES across 8 configs
+    {dense,moe} x {degenerate,natural} x {gen8,gen64}: cross-arm divergence tracks
+    the within-arm baseline to +/-1-3 of 32. Single-sequence greedy is fully
+    deterministic (the md5 gate).
+- **What would change the verdict (re-score path).** Read the finalized DGX
+  `~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md` once the CANDIDATE arm completes
+  (`p4_agg.py` auto-writes medians+stdev with the `> 20%`-drop GO logic baked in). If
+  it shows a genuine `> 20%` staggered-TTFT drop clearing `max(2%, 3*stdev)` with md5
+  green and aggregate not regressed, re-score `go=true` and trigger the full P4
+  build-out: `SLOT_STATE_PREEMPTED` + release-KV-keep-prompt re-admit (paged
+  burst-reclaim 0024 + `paged-alloc.cpp` defrag), aging/starvation-freedom + a
+  constructed starvation test, preemption/aging unit tests, and a forced-preemption
+  byte-identical-resume determinism gate. Else this NO-GO stands.
+
+Implementation (kill-gate subset only; correct, committed, NOT pushed; server-side
+only, ZERO `ggml/` files, ~68 LOC): `tools/server/server-context.cpp` thin
+integration + a NEW pure unit-tested header `tools/server/server-admission-policy.h`
+(namespace `cbv2`) + `server-admission-policy-test.cpp`. (1) Per-seq chunked-prefill
+cursors with a load-adaptive fair-share cap `ceil(prefill_leftover/n_waiting)`
+floored at `LLAMA_CBV2_CHUNK_MIN` (default 128, NOT `n_ubatch`, so a 512 prompt
+actually chunks under load); CBv2 activates the shipped 0016 decode-first budget by
+default (`T=n_batch`) and replaces 0016's fixed cap; cursor =
+`slot.prompt.n_tokens()`. (2) Adaptive decode bucket policy (`LLAMA_CBV2_DECODE_PAD`
+default 0 => `bucket==n_decode`, no padding per `DECODE_SERVING_SCOPE.md`
+net-negative; policy computed+traced only, never fed to batch formation =>
+bit-exact-safe; row-emission is the deferred [Build phase]). Trace under
+`LLAMA_CBV2_TRACE=1`.
+
+Series-numbering flag: P0 comments label `[paged 0056]` per the fork's next slot,
+but the LocalAI worktree README is already ahead at `0056-0061` (MoE MMQ trace
+series) - reconcile on landing (likely `0062`).
+
+Fork `localai-paged` HEAD **untouched at `653bb2f3d`**; LocalAI series stays at 46
+patches (`0001-0055`). Topic branch `mudler/llama.cpp:p4-cbv2` retained at
+`ebb649335fe7686524a3630ee2fdffce44be6d52` (base `653bb2f3d`, NOT pushed). Artifacts
+on the DGX `~/bench/p4_cbv2/`: `build_20260702_192141/` (build.log),
+`gates_20260702_192632/` (SUMMARY.txt: md5 x4, test-backend-ops, cbv2_trace.txt,
+determinism tsvs), `det2_20260702_193123/` + `det3_20260702_193649/` +
+`det4_20260702_194040/` (determinism diff-matrix), `perf_20260702_194359/`
+(raw_*.json + auto-written RESULTS.md). Environment: `LLAMA_KV_PAGED=1
+LLAMA_MOE_FORCE_GRAPHS=1`, `LLAMA_MAX_BATCH_TOKENS` unset, sm_121a, GPU lock held.