docs(paged): record datacenter Blackwell readiness phase

Assisted-by: Codex:gpt-5
2026-07-03 12:57:02 -04:00 · 2026-07-01 14:28:41 +00:00
parent 2efb0ec362
commit eb82ff138f
3 changed files with 112 additions and 6 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
@@ -12,10 +12,10 @@ with artifact path, gates, benchmark rows, and decision.
 - Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
 - Current tested source: DGX mirror
  `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Latest attempt: Phase72.
- Latest decision: keep `LLAMA_TTFT_PREFILL_FIRST=1`
-  `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in only. It regressed broad
-  serving aggregate, decode, TTFT, and wall time at `n=8`, `n=32`, and `n=128`.
+- Latest attempt: Phase73.
+- Latest decision: no new GB10 benchmark or source patch. The next parity
+  evidence requires a datacenter Blackwell rerun, or a standalone GDN
+  blocked-solve PoC before any backend GDN source work.

 ## Current Serving Record

@@ -55,6 +55,39 @@ Decision:

 ## Attempt Log

+### Phase73: Datacenter Blackwell Rerun Readiness
+
+- Date: 2026-07-01.
+- Plan:
+  `docs/superpowers/plans/2026-07-01-datacenter-blackwell-rerun-readiness-phase73.md`.
+- Artifact: no new benchmark artifact.
+- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
+- Result type: harness/spec audit only.
+
+Evidence:
+
+- Phase72 is the current GB10 serving baseline. Default llama decode/vLLM
+  ratios remain `0.7561`, `0.7158`, and `0.6935` at `n=8/32/128`.
+- Grouped-MMQ/W4A16: Phase61 direct activation was the last structurally
+  distinct W4A16 shortcut; it failed its keep gate and stayed far behind
+  default FP4-MMQ. Phase66 quantize plus gather was only `5.10%`, below the
+  source-funding threshold.
+- GDN: Phase71 kept shipped M5 as default. The remaining GDN gap is a larger
+  FLA/CuteDSL-class C=64 blocked-solve/register-state implementation, not
+  another C32/QS/global-Ai/local reorder.
+- Harness: `paged-current-serving-snapshot.sh` already records
+  `hardware_class=datacenter_blackwell` for B200/B100/GB200, supports
+  `DRY_RUN=1`, `SERVED_MODEL_NAME`, and vLLM deployment overrides.
+
+Decision:
+
+- Do not start more GB10 grouped-MMQ/W4A16 source work.
+- Do not start GDN backend source work until a standalone C=64 blocked-solve
+  PoC records timing, numerical error, and resource estimates.
+- The next parity run should be on datacenter Blackwell hardware with the
+  existing same-session serving harness plus graph-node decode profiles.
+- No parity claim is made by this phase.
+
 ### Phase72: TTFT Min32 Broader Serving

 - Date: 2026-07-01.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -1182,3 +1182,64 @@ Decision: keep `LLAMA_TTFT_PREFILL_FIRST=1` plus
 `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in only. It regressed aggregate,
 decode, TTFT, and wall time at every tested concurrency in the broader shape,
 and widened the vLLM decode gap. Do not default this scheduler policy on GB10.
+
+## 18. PHASE73 RESULT: DATACENTER BLACKWELL RERUN READINESS
+
+Phase73 is a no-new-benchmark decision/spec phase. Plan:
+`docs/superpowers/plans/2026-07-01-datacenter-blackwell-rerun-readiness-phase73.md`.
+Benchmark ledger:
+`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`.
+
+No GPU benchmark was run and no llama.cpp source was changed. Source baseline
+remains DGX mirror commit `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
+
+Decision:
+
+- Do not start more GB10 grouped-MMQ/W4A16 source work. Phase61 direct-A was
+  the last structurally distinct W4A16 shortcut and failed its keep gate; Phase66
+  quantize plus gather was only `5.10%`, below the source-funding threshold.
+- Do not start GDN backend source work until a standalone C=64 blocked-solve PoC
+  proves timing and numerical viability. Phase71 kept M5 as shipped; the
+  remaining GDN gap is a larger FLA/CuteDSL-class blocked-solve/register-state
+  implementation, not another C32/QS/global-Ai/local reorder.
+- The next parity evidence should come from datacenter Blackwell hardware with
+  the existing same-session serving harness plus graph-node decode profiles.
+
+B200 rerun checklist:
+
+1. Build and verify the llama.cpp paged binary on B200 or equivalent
+   datacenter Blackwell hardware with the correct CUDA architecture/settings.
+2. Install and verify vLLM `0.23.0+` with the intended Blackwell backend stack.
+3. Confirm both model forms exist: `q36-35b-a3b-nvfp4.gguf` and
+   `q36-35b-a3b-nvfp4-vllm`.
+4. Run `paged-current-serving-snapshot.sh` with `NPL="8 32 128"`, `PTOK=128`,
+   `GEN=64`, `PARALLEL=128`, `CTX=131072`, and B200-specific
+   `VLLM_GPU_MEMORY_UTILIZATION`, `VLLM_MAX_NUM_SEQS`, and
+   `VLLM_TENSOR_PARALLEL_SIZE`.
+5. Before interpreting the artifact, require `hardware.txt` to say
+   `hardware_class=datacenter_blackwell`, `gate_summary.tsv` to be green,
+   pre/post MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT` and
+   `MUL_MAT_ID` op gates green, and `summary.tsv` rows for both paged and vLLM.
+6. Run decode/profile reruns with `nsys --cuda-graph-trace=node` and inspect
+   whether vLLM is using native FP4/CUTLASS/FlashInfer rather than the GB10
+   Marlin fallback.
+
+Standalone GDN source-work gate:
+
+```sh
+nvcc -O3 -arch=sm_121a \
+  ~/scratch_tc_gdn_poc/gdn_blocked_solve_bench.cu \
+  -o ~/scratch_tc_gdn_poc/gdn_blocked_solve_bench
+
+~/scratch_tc_gdn_poc/gdn_blocked_solve_bench \
+  --c 64 --dk 128 --dv 128 \
+  --iters 1000 \
+  --precision tf32,offdiag3x,apply3x \
+  --oracle f64 \
+  --dump-json ~/bench/phase73_gdn_blocked_solve_poc.json
+```
+
+Do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for this larger path until
+that standalone artifact shows a material timing win, non-catastrophic weak and
+mixed decay error, plausible register/shared-memory fit, and records timing,
+precision-rung error, and condition-number distribution.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -1586,6 +1586,16 @@ green, but min32 regressed every tested concurrency: aggregate ratios
 ratios `1.0379`/`1.0977`/`1.0300` at `n=8/32/128`. Keep min32 opt-in only and
 do not default it on GB10.

+Phase73 made the post-Phase72 next-step decision. It ran no new benchmark and
+changed no llama.cpp source. Grouped-MMQ/W4A16 GB10 source work is closed:
+Phase61 direct-A was the last structurally distinct W4A16 shortcut and failed
+its keep gate, and Phase66 quantize plus gather was only `5.10%`. GDN backend
+source work is also gated: Phase71 kept M5 as shipped, and the remaining GDN
+gap is a FLA/CuteDSL-class C=64 blocked-solve/register-state implementation,
+not another local reorder. The next parity evidence should be a datacenter
+Blackwell same-session rerun, or a standalone GDN blocked-solve PoC before any
+backend GDN source work.
+
 ### Phase 60 current W4A16 prefill profile

 Phase60 re-profiled the current W4A16 grouped MoE prefill path after the
@@ -1861,8 +1871,10 @@ revalidated it against sequential-disabled and serial-chunked baselines, and
 Phase10/11/13 rejected the smaller follow-up GDN reorders. Phase41/43 closed
 D1 on the current GB10 path unless a fresh route trace proves a host-sync
 fallback returned. Phase60/61/66 rejected another small W4A16/direct-A or
-quant/gather pass. Treat the list below as pre-Phase60 planning context, not an
-active queue.
+quant/gather pass. Phase72 rejected min32 as a broad serving default, and
+Phase73 set the active queue to datacenter-Blackwell rerun readiness or a
+standalone GDN blocked-solve PoC before source work. Treat the list below as
+pre-Phase60 planning context, not an active queue.

 Ranked, each with its pass-gate: