docs(paged): finalize P4 CBv2 record with the measured A/B verdict

The forced-report placeholders are replaced with the completed 60/60-raw A/B from dgx:~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md: NO-GO confirmed by measurement, and stronger than flat. CBv2 fair-share chunked prefill regresses TTFT under staggered load (N=32 p50 +33.6%, N=128 p50 +15.5%) and regresses aggregate/decode -6.9% beyond noise at staggered N=128. Analysis recorded: processor-sharing delays near-uniform prompt completion by construction; the scheduler-shaped-TTFT premise is partially refuted for GB10 (patch 0016 already captures the schedulable win); TTFT parity routes through P3/P5 prefill compute. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 20:37:03 -04:00 · 2026-07-02 18:09:55 +00:00
parent 865e77c4ec
commit 7b129a51f1
2 changed files with 42 additions and 13 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md
@@ -572,13 +572,33 @@ enabler lever, **not** a throughput lever (decode is GPU-compute-bound; the
 host-loop-dead measurement is real), so a NO-GO on the TTFT perf gate is the
 expected result and any throughput payoff lives on non-GB10 silicon (out of scope).

- **WHY THE PERF GATE DID NOT FIRE GO (honest caveat, not a measured neutrality).**
-  The staggered/burst TTFT A/B was **force-terminated by the harness mid-run** with
-  only the CONTROL arm complete (30/60 raws; CANDIDATE arm never started).
-  Consequently `ttft_n32_stag`, `ttft_n128_stag`, `ttft_n8`, `ttft_burst`, and
-  `agg_delta_pct_worst` are **NOT-YET-MEASURED `0.0` placeholders**, not measured
-  neutrality. No affirmative `> 20%` staggered-TTFT drop was demonstrated, so the
-  kill-gate default (`go=false`) stands.
+- **FINAL MEASURED VERDICT (the A/B completed autonomously after the forced report;
+  full 60/60 raws, 5 reps per arm per shape;
+  `dgx:~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md`): NO-GO CONFIRMED BY
+  MEASUREMENT, and stronger than flat: CBv2-at-this-granularity REGRESSES.**
+  TTFT-GO shapes: NONE. Measured deltas (candidate vs control medians; "clears" =
+  beyond max(2%, 3 sigma)):
+  - staggered N=32: TTFT p50 **+33.6% WORSE** (4559.3 -> 6091.3 ms, clears), mean
+    +31.4% worse (clears), p95 +14.3% worse (clears); agg/decode -3.3/-3.4%
+    (inside a very noisy ~21% gate).
+  - staggered N=128: TTFT p50 +15.5% / mean +17.9% / p95 +12.1% worse (all clear);
+    **aggregate -6.9% and decode-agg -6.9% REGRESSED beyond noise** (0.4% sd).
+  - burst N=128: TTFT p50 +13.5% / mean +10.5% worse (clear); agg -3.9% (clears).
+  - staggered N=8 and burst N=8: neutral. burst N=32: decode-agg +36.3% (barely
+    clears a 35.2% noise gate; high-variance shape; the one positive signal:
+    fair-share keeps decodes flowing through a prefill wave).
+- **WHY (analysis, recorded so it is not re-litigated):** fair-share chunked
+  prefill is processor-sharing; for a near-uniform prompt population it delays
+  every prompt's prefill completion versus run-to-completion admission
+  (round-robin maximizes mean completion time for identical jobs), so TTFT rises
+  by construction, and at N=128 the extra interleave overhead also costs
+  throughput. The premise that the TTFT scaling curve was "scheduler-shaped" is
+  hereby PARTIALLY REFUTED for GB10: the shipped decode-first budget (patch 0016)
+  already captures the schedulable win, and vLLM's TTFT advantage on this hardware
+  is dominated by its 2.6-2.8x prefill compute (buckets 1-2), not batch formation.
+  TTFT parity therefore routes through P3/P5 (prefill compute), not the scheduler.
+  Chunked-prefill fair-share may still pay on mixed long/short-prompt workloads
+  and on non-GB10 (host-bound) silicon; both are out of scope here.
 - **CORRECTNESS GATES ALL GREEN (DGX GB10, arch sm_121a), the substantive P0
  result.** Behind `LLAMA_CONTINUOUS_BATCH_V2=1` (default OFF, byte-identical off):
  - **(a) canonical md5 GREEN both models, default-off AND cbv2-on:** paged-MoE
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -2491,12 +2491,21 @@ provenance:
  **not** a throughput lever (decode is GPU-compute-bound; the host-loop-dead
  measurement is real), so a NO-GO on the TTFT perf gate is the expected outcome and
  any throughput payoff is non-GB10 (out of scope).
- **Why the perf gate did not fire GO (honest caveat, not measured neutrality).**
-  The staggered/burst TTFT A/B was **force-terminated by the harness mid-run** with
-  only the CONTROL arm complete (30/60 raws; CANDIDATE never started), so the four
-  `ttft_*_delta_pct` and `agg_delta_pct_worst` fields are **NOT-YET-MEASURED `0.0`
-  placeholders**, not measured neutrality. No affirmative `> 20%` staggered-TTFT drop
-  was demonstrated; the kill-gate default (`go=false`) stands.
+- **FINAL MEASURED VERDICT (A/B completed autonomously after the forced report;
+  60/60 raws, 5 reps/arm/shape; `dgx:~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md`):
+  NO-GO CONFIRMED, and stronger than flat: CBv2-at-this-granularity REGRESSES.**
+  TTFT-GO shapes: NONE. staggered N=32 TTFT p50 **+33.6% WORSE** (4559 -> 6091 ms,
+  clears noise), mean +31.4% worse; staggered N=128 TTFT p50 +15.5% / mean +17.9%
+  worse AND **aggregate/decode-agg -6.9% regressed beyond noise**; burst N=128 TTFT
+  +10-13% worse, agg -3.9%; N=8 shapes neutral; the one positive was burst N=32
+  decode-agg +36.3% on a very noisy shape. ANALYSIS (do not re-litigate): fair-share
+  chunked prefill is processor-sharing and delays every near-uniform prompt's prefill
+  completion versus run-to-completion admission, so TTFT rises by construction; the
+  "TTFT scaling is scheduler-shaped" premise is PARTIALLY REFUTED for GB10 - patch
+  0016's decode-first budget already captures the schedulable win, and vLLM's TTFT
+  advantage here is dominated by its 2.6-2.8x prefill compute. TTFT parity routes
+  through P3/P5 (prefill compute), not the scheduler. Fair-share may still pay on
+  mixed long/short-prompt workloads and non-GB10 (host-bound) silicon; out of scope.
 - **Correctness gates all GREEN (DGX GB10, sm_121a), the substantive P0 result.**
  Behind `LLAMA_CONTINUOUS_BATCH_V2=1` (default OFF, byte-identical off):
  - **(a) canonical md5 GREEN both models, default-off AND cbv2-on:** paged-MoE