From 7b129a51f1fa4ef04cc784d276b80eea514e6807 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Thu, 2 Jul 2026 18:09:55 +0000 Subject: [PATCH] docs(paged): finalize P4 CBv2 record with the measured A/B verdict The forced-report placeholders are replaced with the completed 60/60-raw A/B from dgx:~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md: NO-GO confirmed by measurement, and stronger than flat. CBv2 fair-share chunked prefill regresses TTFT under staggered load (N=32 p50 +33.6%, N=128 p50 +15.5%) and regresses aggregate/decode -6.9% beyond noise at staggered N=128. Analysis recorded: processor-sharing delays near-uniform prompt completion by construction; the scheduler-shaped-TTFT premise is partially refuted for GB10 (patch 0016 already captures the schedulable win); TTFT parity routes through P3/P5 prefill compute. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../docs/EXECUTION_REARCH_SCOPE.md | 34 +++++++++++++++---- .../docs/PARITY_HANDOFF.md | 21 ++++++++---- 2 files changed, 42 insertions(+), 13 deletions(-) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md b/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md index e83bab1a6..993c68268 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md @@ -572,13 +572,33 @@ enabler lever, **not** a throughput lever (decode is GPU-compute-bound; the host-loop-dead measurement is real), so a NO-GO on the TTFT perf gate is the expected result and any throughput payoff lives on non-GB10 silicon (out of scope). -- **WHY THE PERF GATE DID NOT FIRE GO (honest caveat, not a measured neutrality).** - The staggered/burst TTFT A/B was **force-terminated by the harness mid-run** with - only the CONTROL arm complete (30/60 raws; CANDIDATE arm never started). - Consequently `ttft_n32_stag`, `ttft_n128_stag`, `ttft_n8`, `ttft_burst`, and - `agg_delta_pct_worst` are **NOT-YET-MEASURED `0.0` placeholders**, not measured - neutrality. No affirmative `> 20%` staggered-TTFT drop was demonstrated, so the - kill-gate default (`go=false`) stands. +- **FINAL MEASURED VERDICT (the A/B completed autonomously after the forced report; + full 60/60 raws, 5 reps per arm per shape; + `dgx:~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md`): NO-GO CONFIRMED BY + MEASUREMENT, and stronger than flat: CBv2-at-this-granularity REGRESSES.** + TTFT-GO shapes: NONE. Measured deltas (candidate vs control medians; "clears" = + beyond max(2%, 3 sigma)): + - staggered N=32: TTFT p50 **+33.6% WORSE** (4559.3 -> 6091.3 ms, clears), mean + +31.4% worse (clears), p95 +14.3% worse (clears); agg/decode -3.3/-3.4% + (inside a very noisy ~21% gate). + - staggered N=128: TTFT p50 +15.5% / mean +17.9% / p95 +12.1% worse (all clear); + **aggregate -6.9% and decode-agg -6.9% REGRESSED beyond noise** (0.4% sd). + - burst N=128: TTFT p50 +13.5% / mean +10.5% worse (clear); agg -3.9% (clears). + - staggered N=8 and burst N=8: neutral. burst N=32: decode-agg +36.3% (barely + clears a 35.2% noise gate; high-variance shape; the one positive signal: + fair-share keeps decodes flowing through a prefill wave). +- **WHY (analysis, recorded so it is not re-litigated):** fair-share chunked + prefill is processor-sharing; for a near-uniform prompt population it delays + every prompt's prefill completion versus run-to-completion admission + (round-robin maximizes mean completion time for identical jobs), so TTFT rises + by construction, and at N=128 the extra interleave overhead also costs + throughput. The premise that the TTFT scaling curve was "scheduler-shaped" is + hereby PARTIALLY REFUTED for GB10: the shipped decode-first budget (patch 0016) + already captures the schedulable win, and vLLM's TTFT advantage on this hardware + is dominated by its 2.6-2.8x prefill compute (buckets 1-2), not batch formation. + TTFT parity therefore routes through P3/P5 (prefill compute), not the scheduler. + Chunked-prefill fair-share may still pay on mixed long/short-prompt workloads + and on non-GB10 (host-bound) silicon; both are out of scope here. - **CORRECTNESS GATES ALL GREEN (DGX GB10, arch sm_121a), the substantive P0 result.** Behind `LLAMA_CONTINUOUS_BATCH_V2=1` (default OFF, byte-identical off): - **(a) canonical md5 GREEN both models, default-off AND cbv2-on:** paged-MoE diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index c8ddfa14d..4abaf8583 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -2491,12 +2491,21 @@ provenance: **not** a throughput lever (decode is GPU-compute-bound; the host-loop-dead measurement is real), so a NO-GO on the TTFT perf gate is the expected outcome and any throughput payoff is non-GB10 (out of scope). -- **Why the perf gate did not fire GO (honest caveat, not measured neutrality).** - The staggered/burst TTFT A/B was **force-terminated by the harness mid-run** with - only the CONTROL arm complete (30/60 raws; CANDIDATE never started), so the four - `ttft_*_delta_pct` and `agg_delta_pct_worst` fields are **NOT-YET-MEASURED `0.0` - placeholders**, not measured neutrality. No affirmative `> 20%` staggered-TTFT drop - was demonstrated; the kill-gate default (`go=false`) stands. +- **FINAL MEASURED VERDICT (A/B completed autonomously after the forced report; + 60/60 raws, 5 reps/arm/shape; `dgx:~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md`): + NO-GO CONFIRMED, and stronger than flat: CBv2-at-this-granularity REGRESSES.** + TTFT-GO shapes: NONE. staggered N=32 TTFT p50 **+33.6% WORSE** (4559 -> 6091 ms, + clears noise), mean +31.4% worse; staggered N=128 TTFT p50 +15.5% / mean +17.9% + worse AND **aggregate/decode-agg -6.9% regressed beyond noise**; burst N=128 TTFT + +10-13% worse, agg -3.9%; N=8 shapes neutral; the one positive was burst N=32 + decode-agg +36.3% on a very noisy shape. ANALYSIS (do not re-litigate): fair-share + chunked prefill is processor-sharing and delays every near-uniform prompt's prefill + completion versus run-to-completion admission, so TTFT rises by construction; the + "TTFT scaling is scheduler-shaped" premise is PARTIALLY REFUTED for GB10 - patch + 0016's decode-first budget already captures the schedulable win, and vLLM's TTFT + advantage here is dominated by its 2.6-2.8x prefill compute. TTFT parity routes + through P3/P5 (prefill compute), not the scheduler. Fair-share may still pay on + mixed long/short-prompt workloads and non-GB10 (host-bound) silicon; out of scope. - **Correctness gates all GREEN (DGX GB10, sm_121a), the substantive P0 result.** Behind `LLAMA_CONTINUOUS_BATCH_V2=1` (default OFF, byte-identical off): - **(a) canonical md5 GREEN both models, default-off AND cbv2-on:** paged-MoE