From eb82ff138ffbe9dee19272092949566e9d7496c7 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 14:28:41 +0000 Subject: [PATCH] docs(paged): record datacenter Blackwell readiness phase Assisted-by: Codex:gpt-5 --- .../llama-cpp-localai-paged/docs/BENCHMARK.md | 41 +++++++++++-- .../docs/PARITY_HANDOFF.md | 61 +++++++++++++++++++ .../docs/VLLM_PARITY_LEVER_MAP.md | 16 ++++- 3 files changed, 112 insertions(+), 6 deletions(-) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md index a5ef7c9a3..4f9fab52b 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md @@ -12,10 +12,10 @@ with artifact path, gates, benchmark rows, and decision. - Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. - Current tested source: DGX mirror `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. -- Latest attempt: Phase72. -- Latest decision: keep `LLAMA_TTFT_PREFILL_FIRST=1` - `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in only. It regressed broad - serving aggregate, decode, TTFT, and wall time at `n=8`, `n=32`, and `n=128`. +- Latest attempt: Phase73. +- Latest decision: no new GB10 benchmark or source patch. The next parity + evidence requires a datacenter Blackwell rerun, or a standalone GDN + blocked-solve PoC before any backend GDN source work. ## Current Serving Record @@ -55,6 +55,39 @@ Decision: ## Attempt Log +### Phase73: Datacenter Blackwell Rerun Readiness + +- Date: 2026-07-01. +- Plan: + `docs/superpowers/plans/2026-07-01-datacenter-blackwell-rerun-readiness-phase73.md`. +- Artifact: no new benchmark artifact. +- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. +- Result type: harness/spec audit only. + +Evidence: + +- Phase72 is the current GB10 serving baseline. Default llama decode/vLLM + ratios remain `0.7561`, `0.7158`, and `0.6935` at `n=8/32/128`. +- Grouped-MMQ/W4A16: Phase61 direct activation was the last structurally + distinct W4A16 shortcut; it failed its keep gate and stayed far behind + default FP4-MMQ. Phase66 quantize plus gather was only `5.10%`, below the + source-funding threshold. +- GDN: Phase71 kept shipped M5 as default. The remaining GDN gap is a larger + FLA/CuteDSL-class C=64 blocked-solve/register-state implementation, not + another C32/QS/global-Ai/local reorder. +- Harness: `paged-current-serving-snapshot.sh` already records + `hardware_class=datacenter_blackwell` for B200/B100/GB200, supports + `DRY_RUN=1`, `SERVED_MODEL_NAME`, and vLLM deployment overrides. + +Decision: + +- Do not start more GB10 grouped-MMQ/W4A16 source work. +- Do not start GDN backend source work until a standalone C=64 blocked-solve + PoC records timing, numerical error, and resource estimates. +- The next parity run should be on datacenter Blackwell hardware with the + existing same-session serving harness plus graph-node decode profiles. +- No parity claim is made by this phase. + ### Phase72: TTFT Min32 Broader Serving - Date: 2026-07-01. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 6e0319895..bdcea4521 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -1182,3 +1182,64 @@ Decision: keep `LLAMA_TTFT_PREFILL_FIRST=1` plus `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in only. It regressed aggregate, decode, TTFT, and wall time at every tested concurrency in the broader shape, and widened the vLLM decode gap. Do not default this scheduler policy on GB10. + +## 18. PHASE73 RESULT: DATACENTER BLACKWELL RERUN READINESS + +Phase73 is a no-new-benchmark decision/spec phase. Plan: +`docs/superpowers/plans/2026-07-01-datacenter-blackwell-rerun-readiness-phase73.md`. +Benchmark ledger: +`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`. + +No GPU benchmark was run and no llama.cpp source was changed. Source baseline +remains DGX mirror commit `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. + +Decision: + +- Do not start more GB10 grouped-MMQ/W4A16 source work. Phase61 direct-A was + the last structurally distinct W4A16 shortcut and failed its keep gate; Phase66 + quantize plus gather was only `5.10%`, below the source-funding threshold. +- Do not start GDN backend source work until a standalone C=64 blocked-solve PoC + proves timing and numerical viability. Phase71 kept M5 as shipped; the + remaining GDN gap is a larger FLA/CuteDSL-class blocked-solve/register-state + implementation, not another C32/QS/global-Ai/local reorder. +- The next parity evidence should come from datacenter Blackwell hardware with + the existing same-session serving harness plus graph-node decode profiles. + +B200 rerun checklist: + +1. Build and verify the llama.cpp paged binary on B200 or equivalent + datacenter Blackwell hardware with the correct CUDA architecture/settings. +2. Install and verify vLLM `0.23.0+` with the intended Blackwell backend stack. +3. Confirm both model forms exist: `q36-35b-a3b-nvfp4.gguf` and + `q36-35b-a3b-nvfp4-vllm`. +4. Run `paged-current-serving-snapshot.sh` with `NPL="8 32 128"`, `PTOK=128`, + `GEN=64`, `PARALLEL=128`, `CTX=131072`, and B200-specific + `VLLM_GPU_MEMORY_UTILIZATION`, `VLLM_MAX_NUM_SEQS`, and + `VLLM_TENSOR_PARALLEL_SIZE`. +5. Before interpreting the artifact, require `hardware.txt` to say + `hardware_class=datacenter_blackwell`, `gate_summary.tsv` to be green, + pre/post MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT` and + `MUL_MAT_ID` op gates green, and `summary.tsv` rows for both paged and vLLM. +6. Run decode/profile reruns with `nsys --cuda-graph-trace=node` and inspect + whether vLLM is using native FP4/CUTLASS/FlashInfer rather than the GB10 + Marlin fallback. + +Standalone GDN source-work gate: + +```sh +nvcc -O3 -arch=sm_121a \ + ~/scratch_tc_gdn_poc/gdn_blocked_solve_bench.cu \ + -o ~/scratch_tc_gdn_poc/gdn_blocked_solve_bench + +~/scratch_tc_gdn_poc/gdn_blocked_solve_bench \ + --c 64 --dk 128 --dv 128 \ + --iters 1000 \ + --precision tf32,offdiag3x,apply3x \ + --oracle f64 \ + --dump-json ~/bench/phase73_gdn_blocked_solve_poc.json +``` + +Do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for this larger path until +that standalone artifact shows a material timing win, non-catastrophic weak and +mixed decay error, plausible register/shared-memory fit, and records timing, +precision-rung error, and condition-number distribution. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index f81811eae..31cbd9d5f 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -1586,6 +1586,16 @@ green, but min32 regressed every tested concurrency: aggregate ratios ratios `1.0379`/`1.0977`/`1.0300` at `n=8/32/128`. Keep min32 opt-in only and do not default it on GB10. +Phase73 made the post-Phase72 next-step decision. It ran no new benchmark and +changed no llama.cpp source. Grouped-MMQ/W4A16 GB10 source work is closed: +Phase61 direct-A was the last structurally distinct W4A16 shortcut and failed +its keep gate, and Phase66 quantize plus gather was only `5.10%`. GDN backend +source work is also gated: Phase71 kept M5 as shipped, and the remaining GDN +gap is a FLA/CuteDSL-class C=64 blocked-solve/register-state implementation, +not another local reorder. The next parity evidence should be a datacenter +Blackwell same-session rerun, or a standalone GDN blocked-solve PoC before any +backend GDN source work. + ### Phase 60 current W4A16 prefill profile Phase60 re-profiled the current W4A16 grouped MoE prefill path after the @@ -1861,8 +1871,10 @@ revalidated it against sequential-disabled and serial-chunked baselines, and Phase10/11/13 rejected the smaller follow-up GDN reorders. Phase41/43 closed D1 on the current GB10 path unless a fresh route trace proves a host-sync fallback returned. Phase60/61/66 rejected another small W4A16/direct-A or -quant/gather pass. Treat the list below as pre-Phase60 planning context, not an -active queue. +quant/gather pass. Phase72 rejected min32 as a broad serving default, and +Phase73 set the active queue to datacenter-Blackwell rerun readiness or a +standalone GDN blocked-solve PoC before source work. Treat the list below as +pre-Phase60 planning context, not an active queue. Ranked, each with its pass-gate: