docs(paged): finalize P4 CBv2 record with the measured A/B verdict

The forced-report placeholders are replaced with the completed 60/60-raw A/B
from dgx:~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md: NO-GO confirmed by
measurement, and stronger than flat. CBv2 fair-share chunked prefill regresses
TTFT under staggered load (N=32 p50 +33.6%, N=128 p50 +15.5%) and regresses
aggregate/decode -6.9% beyond noise at staggered N=128. Analysis recorded:
processor-sharing delays near-uniform prompt completion by construction; the
scheduler-shaped-TTFT premise is partially refuted for GB10 (patch 0016 already
captures the schedulable win); TTFT parity routes through P3/P5 prefill compute.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-07-02 18:09:55 +00:00
parent 865e77c4ec
commit 7b129a51f1
2 changed files with 42 additions and 13 deletions

View File

@@ -572,13 +572,33 @@ enabler lever, **not** a throughput lever (decode is GPU-compute-bound; the
host-loop-dead measurement is real), so a NO-GO on the TTFT perf gate is the
expected result and any throughput payoff lives on non-GB10 silicon (out of scope).
- **WHY THE PERF GATE DID NOT FIRE GO (honest caveat, not a measured neutrality).**
The staggered/burst TTFT A/B was **force-terminated by the harness mid-run** with
only the CONTROL arm complete (30/60 raws; CANDIDATE arm never started).
Consequently `ttft_n32_stag`, `ttft_n128_stag`, `ttft_n8`, `ttft_burst`, and
`agg_delta_pct_worst` are **NOT-YET-MEASURED `0.0` placeholders**, not measured
neutrality. No affirmative `> 20%` staggered-TTFT drop was demonstrated, so the
kill-gate default (`go=false`) stands.
- **FINAL MEASURED VERDICT (the A/B completed autonomously after the forced report;
full 60/60 raws, 5 reps per arm per shape;
`dgx:~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md`): NO-GO CONFIRMED BY
MEASUREMENT, and stronger than flat: CBv2-at-this-granularity REGRESSES.**
TTFT-GO shapes: NONE. Measured deltas (candidate vs control medians; "clears" =
beyond max(2%, 3 sigma)):
- staggered N=32: TTFT p50 **+33.6% WORSE** (4559.3 -> 6091.3 ms, clears), mean
+31.4% worse (clears), p95 +14.3% worse (clears); agg/decode -3.3/-3.4%
(inside a very noisy ~21% gate).
- staggered N=128: TTFT p50 +15.5% / mean +17.9% / p95 +12.1% worse (all clear);
**aggregate -6.9% and decode-agg -6.9% REGRESSED beyond noise** (0.4% sd).
- burst N=128: TTFT p50 +13.5% / mean +10.5% worse (clear); agg -3.9% (clears).
- staggered N=8 and burst N=8: neutral. burst N=32: decode-agg +36.3% (barely
clears a 35.2% noise gate; high-variance shape; the one positive signal:
fair-share keeps decodes flowing through a prefill wave).
- **WHY (analysis, recorded so it is not re-litigated):** fair-share chunked
prefill is processor-sharing; for a near-uniform prompt population it delays
every prompt's prefill completion versus run-to-completion admission
(round-robin maximizes mean completion time for identical jobs), so TTFT rises
by construction, and at N=128 the extra interleave overhead also costs
throughput. The premise that the TTFT scaling curve was "scheduler-shaped" is
hereby PARTIALLY REFUTED for GB10: the shipped decode-first budget (patch 0016)
already captures the schedulable win, and vLLM's TTFT advantage on this hardware
is dominated by its 2.6-2.8x prefill compute (buckets 1-2), not batch formation.
TTFT parity therefore routes through P3/P5 (prefill compute), not the scheduler.
Chunked-prefill fair-share may still pay on mixed long/short-prompt workloads
and on non-GB10 (host-bound) silicon; both are out of scope here.
- **CORRECTNESS GATES ALL GREEN (DGX GB10, arch sm_121a), the substantive P0
result.** Behind `LLAMA_CONTINUOUS_BATCH_V2=1` (default OFF, byte-identical off):
- **(a) canonical md5 GREEN both models, default-off AND cbv2-on:** paged-MoE

View File

@@ -2491,12 +2491,21 @@ provenance:
**not** a throughput lever (decode is GPU-compute-bound; the host-loop-dead
measurement is real), so a NO-GO on the TTFT perf gate is the expected outcome and
any throughput payoff is non-GB10 (out of scope).
- **Why the perf gate did not fire GO (honest caveat, not measured neutrality).**
The staggered/burst TTFT A/B was **force-terminated by the harness mid-run** with
only the CONTROL arm complete (30/60 raws; CANDIDATE never started), so the four
`ttft_*_delta_pct` and `agg_delta_pct_worst` fields are **NOT-YET-MEASURED `0.0`
placeholders**, not measured neutrality. No affirmative `> 20%` staggered-TTFT drop
was demonstrated; the kill-gate default (`go=false`) stands.
- **FINAL MEASURED VERDICT (A/B completed autonomously after the forced report;
60/60 raws, 5 reps/arm/shape; `dgx:~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md`):
NO-GO CONFIRMED, and stronger than flat: CBv2-at-this-granularity REGRESSES.**
TTFT-GO shapes: NONE. staggered N=32 TTFT p50 **+33.6% WORSE** (4559 -> 6091 ms,
clears noise), mean +31.4% worse; staggered N=128 TTFT p50 +15.5% / mean +17.9%
worse AND **aggregate/decode-agg -6.9% regressed beyond noise**; burst N=128 TTFT
+10-13% worse, agg -3.9%; N=8 shapes neutral; the one positive was burst N=32
decode-agg +36.3% on a very noisy shape. ANALYSIS (do not re-litigate): fair-share
chunked prefill is processor-sharing and delays every near-uniform prompt's prefill
completion versus run-to-completion admission, so TTFT rises by construction; the
"TTFT scaling is scheduler-shaped" premise is PARTIALLY REFUTED for GB10 - patch
0016's decode-first budget already captures the schedulable win, and vLLM's TTFT
advantage here is dominated by its 2.6-2.8x prefill compute. TTFT parity routes
through P3/P5 (prefill compute), not the scheduler. Fair-share may still pay on
mixed long/short-prompt workloads and non-GB10 (host-bound) silicon; out of scope.
- **Correctness gates all GREEN (DGX GB10, sm_121a), the substantive P0 result.**
Behind `LLAMA_CONTINUOUS_BATCH_V2=1` (default OFF, byte-identical off):
- **(a) canonical md5 GREEN both models, default-off AND cbv2-on:** paged-MoE