docs(paged): record datacenter Blackwell readiness phase

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 14:28:41 +00:00
parent 2efb0ec362
commit eb82ff138f
3 changed files with 112 additions and 6 deletions

View File

@@ -12,10 +12,10 @@ with artifact path, gates, benchmark rows, and decision.
- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
- Current tested source: DGX mirror
`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Latest attempt: Phase72.
- Latest decision: keep `LLAMA_TTFT_PREFILL_FIRST=1`
`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in only. It regressed broad
serving aggregate, decode, TTFT, and wall time at `n=8`, `n=32`, and `n=128`.
- Latest attempt: Phase73.
- Latest decision: no new GB10 benchmark or source patch. The next parity
evidence requires a datacenter Blackwell rerun, or a standalone GDN
blocked-solve PoC before any backend GDN source work.
## Current Serving Record
@@ -55,6 +55,39 @@ Decision:
## Attempt Log
### Phase73: Datacenter Blackwell Rerun Readiness
- Date: 2026-07-01.
- Plan:
`docs/superpowers/plans/2026-07-01-datacenter-blackwell-rerun-readiness-phase73.md`.
- Artifact: no new benchmark artifact.
- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Result type: harness/spec audit only.
Evidence:
- Phase72 is the current GB10 serving baseline. Default llama decode/vLLM
ratios remain `0.7561`, `0.7158`, and `0.6935` at `n=8/32/128`.
- Grouped-MMQ/W4A16: Phase61 direct activation was the last structurally
distinct W4A16 shortcut; it failed its keep gate and stayed far behind
default FP4-MMQ. Phase66 quantize plus gather was only `5.10%`, below the
source-funding threshold.
- GDN: Phase71 kept shipped M5 as default. The remaining GDN gap is a larger
FLA/CuteDSL-class C=64 blocked-solve/register-state implementation, not
another C32/QS/global-Ai/local reorder.
- Harness: `paged-current-serving-snapshot.sh` already records
`hardware_class=datacenter_blackwell` for B200/B100/GB200, supports
`DRY_RUN=1`, `SERVED_MODEL_NAME`, and vLLM deployment overrides.
Decision:
- Do not start more GB10 grouped-MMQ/W4A16 source work.
- Do not start GDN backend source work until a standalone C=64 blocked-solve
PoC records timing, numerical error, and resource estimates.
- The next parity run should be on datacenter Blackwell hardware with the
existing same-session serving harness plus graph-node decode profiles.
- No parity claim is made by this phase.
### Phase72: TTFT Min32 Broader Serving
- Date: 2026-07-01.

View File

@@ -1182,3 +1182,64 @@ Decision: keep `LLAMA_TTFT_PREFILL_FIRST=1` plus
`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in only. It regressed aggregate,
decode, TTFT, and wall time at every tested concurrency in the broader shape,
and widened the vLLM decode gap. Do not default this scheduler policy on GB10.
## 18. PHASE73 RESULT: DATACENTER BLACKWELL RERUN READINESS
Phase73 is a no-new-benchmark decision/spec phase. Plan:
`docs/superpowers/plans/2026-07-01-datacenter-blackwell-rerun-readiness-phase73.md`.
Benchmark ledger:
`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`.
No GPU benchmark was run and no llama.cpp source was changed. Source baseline
remains DGX mirror commit `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
Decision:
- Do not start more GB10 grouped-MMQ/W4A16 source work. Phase61 direct-A was
the last structurally distinct W4A16 shortcut and failed its keep gate; Phase66
quantize plus gather was only `5.10%`, below the source-funding threshold.
- Do not start GDN backend source work until a standalone C=64 blocked-solve PoC
proves timing and numerical viability. Phase71 kept M5 as shipped; the
remaining GDN gap is a larger FLA/CuteDSL-class blocked-solve/register-state
implementation, not another C32/QS/global-Ai/local reorder.
- The next parity evidence should come from datacenter Blackwell hardware with
the existing same-session serving harness plus graph-node decode profiles.
B200 rerun checklist:
1. Build and verify the llama.cpp paged binary on B200 or equivalent
datacenter Blackwell hardware with the correct CUDA architecture/settings.
2. Install and verify vLLM `0.23.0+` with the intended Blackwell backend stack.
3. Confirm both model forms exist: `q36-35b-a3b-nvfp4.gguf` and
`q36-35b-a3b-nvfp4-vllm`.
4. Run `paged-current-serving-snapshot.sh` with `NPL="8 32 128"`, `PTOK=128`,
`GEN=64`, `PARALLEL=128`, `CTX=131072`, and B200-specific
`VLLM_GPU_MEMORY_UTILIZATION`, `VLLM_MAX_NUM_SEQS`, and
`VLLM_TENSOR_PARALLEL_SIZE`.
5. Before interpreting the artifact, require `hardware.txt` to say
`hardware_class=datacenter_blackwell`, `gate_summary.tsv` to be green,
pre/post MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT` and
`MUL_MAT_ID` op gates green, and `summary.tsv` rows for both paged and vLLM.
6. Run decode/profile reruns with `nsys --cuda-graph-trace=node` and inspect
whether vLLM is using native FP4/CUTLASS/FlashInfer rather than the GB10
Marlin fallback.
Standalone GDN source-work gate:
```sh
nvcc -O3 -arch=sm_121a \
~/scratch_tc_gdn_poc/gdn_blocked_solve_bench.cu \
-o ~/scratch_tc_gdn_poc/gdn_blocked_solve_bench
~/scratch_tc_gdn_poc/gdn_blocked_solve_bench \
--c 64 --dk 128 --dv 128 \
--iters 1000 \
--precision tf32,offdiag3x,apply3x \
--oracle f64 \
--dump-json ~/bench/phase73_gdn_blocked_solve_poc.json
```
Do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for this larger path until
that standalone artifact shows a material timing win, non-catastrophic weak and
mixed decay error, plausible register/shared-memory fit, and records timing,
precision-rung error, and condition-number distribution.

View File

@@ -1586,6 +1586,16 @@ green, but min32 regressed every tested concurrency: aggregate ratios
ratios `1.0379`/`1.0977`/`1.0300` at `n=8/32/128`. Keep min32 opt-in only and
do not default it on GB10.
Phase73 made the post-Phase72 next-step decision. It ran no new benchmark and
changed no llama.cpp source. Grouped-MMQ/W4A16 GB10 source work is closed:
Phase61 direct-A was the last structurally distinct W4A16 shortcut and failed
its keep gate, and Phase66 quantize plus gather was only `5.10%`. GDN backend
source work is also gated: Phase71 kept M5 as shipped, and the remaining GDN
gap is a FLA/CuteDSL-class C=64 blocked-solve/register-state implementation,
not another local reorder. The next parity evidence should be a datacenter
Blackwell same-session rerun, or a standalone GDN blocked-solve PoC before any
backend GDN source work.
### Phase 60 current W4A16 prefill profile
Phase60 re-profiled the current W4A16 grouped MoE prefill path after the
@@ -1861,8 +1871,10 @@ revalidated it against sequential-disabled and serial-chunked baselines, and
Phase10/11/13 rejected the smaller follow-up GDN reorders. Phase41/43 closed
D1 on the current GB10 path unless a fresh route trace proves a host-sync
fallback returned. Phase60/61/66 rejected another small W4A16/direct-A or
quant/gather pass. Treat the list below as pre-Phase60 planning context, not an
active queue.
quant/gather pass. Phase72 rejected min32 as a broad serving default, and
Phase73 set the active queue to datacenter-Blackwell rerun readiness or a
standalone GDN blocked-solve PoC before source work. Treat the list below as
pre-Phase60 planning context, not an active queue.
Ranked, each with its pass-gate: