mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 12:57:02 -04:00
docs(paged): record datacenter Blackwell readiness phase
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -12,10 +12,10 @@ with artifact path, gates, benchmark rows, and decision.
|
||||
- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Current tested source: DGX mirror
|
||||
`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
|
||||
- Latest attempt: Phase72.
|
||||
- Latest decision: keep `LLAMA_TTFT_PREFILL_FIRST=1`
|
||||
`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in only. It regressed broad
|
||||
serving aggregate, decode, TTFT, and wall time at `n=8`, `n=32`, and `n=128`.
|
||||
- Latest attempt: Phase73.
|
||||
- Latest decision: no new GB10 benchmark or source patch. The next parity
|
||||
evidence requires a datacenter Blackwell rerun, or a standalone GDN
|
||||
blocked-solve PoC before any backend GDN source work.
|
||||
|
||||
## Current Serving Record
|
||||
|
||||
@@ -55,6 +55,39 @@ Decision:
|
||||
|
||||
## Attempt Log
|
||||
|
||||
### Phase73: Datacenter Blackwell Rerun Readiness
|
||||
|
||||
- Date: 2026-07-01.
|
||||
- Plan:
|
||||
`docs/superpowers/plans/2026-07-01-datacenter-blackwell-rerun-readiness-phase73.md`.
|
||||
- Artifact: no new benchmark artifact.
|
||||
- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
|
||||
- Result type: harness/spec audit only.
|
||||
|
||||
Evidence:
|
||||
|
||||
- Phase72 is the current GB10 serving baseline. Default llama decode/vLLM
|
||||
ratios remain `0.7561`, `0.7158`, and `0.6935` at `n=8/32/128`.
|
||||
- Grouped-MMQ/W4A16: Phase61 direct activation was the last structurally
|
||||
distinct W4A16 shortcut; it failed its keep gate and stayed far behind
|
||||
default FP4-MMQ. Phase66 quantize plus gather was only `5.10%`, below the
|
||||
source-funding threshold.
|
||||
- GDN: Phase71 kept shipped M5 as default. The remaining GDN gap is a larger
|
||||
FLA/CuteDSL-class C=64 blocked-solve/register-state implementation, not
|
||||
another C32/QS/global-Ai/local reorder.
|
||||
- Harness: `paged-current-serving-snapshot.sh` already records
|
||||
`hardware_class=datacenter_blackwell` for B200/B100/GB200, supports
|
||||
`DRY_RUN=1`, `SERVED_MODEL_NAME`, and vLLM deployment overrides.
|
||||
|
||||
Decision:
|
||||
|
||||
- Do not start more GB10 grouped-MMQ/W4A16 source work.
|
||||
- Do not start GDN backend source work until a standalone C=64 blocked-solve
|
||||
PoC records timing, numerical error, and resource estimates.
|
||||
- The next parity run should be on datacenter Blackwell hardware with the
|
||||
existing same-session serving harness plus graph-node decode profiles.
|
||||
- No parity claim is made by this phase.
|
||||
|
||||
### Phase72: TTFT Min32 Broader Serving
|
||||
|
||||
- Date: 2026-07-01.
|
||||
|
||||
@@ -1182,3 +1182,64 @@ Decision: keep `LLAMA_TTFT_PREFILL_FIRST=1` plus
|
||||
`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in only. It regressed aggregate,
|
||||
decode, TTFT, and wall time at every tested concurrency in the broader shape,
|
||||
and widened the vLLM decode gap. Do not default this scheduler policy on GB10.
|
||||
|
||||
## 18. PHASE73 RESULT: DATACENTER BLACKWELL RERUN READINESS
|
||||
|
||||
Phase73 is a no-new-benchmark decision/spec phase. Plan:
|
||||
`docs/superpowers/plans/2026-07-01-datacenter-blackwell-rerun-readiness-phase73.md`.
|
||||
Benchmark ledger:
|
||||
`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`.
|
||||
|
||||
No GPU benchmark was run and no llama.cpp source was changed. Source baseline
|
||||
remains DGX mirror commit `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
|
||||
|
||||
Decision:
|
||||
|
||||
- Do not start more GB10 grouped-MMQ/W4A16 source work. Phase61 direct-A was
|
||||
the last structurally distinct W4A16 shortcut and failed its keep gate; Phase66
|
||||
quantize plus gather was only `5.10%`, below the source-funding threshold.
|
||||
- Do not start GDN backend source work until a standalone C=64 blocked-solve PoC
|
||||
proves timing and numerical viability. Phase71 kept M5 as shipped; the
|
||||
remaining GDN gap is a larger FLA/CuteDSL-class blocked-solve/register-state
|
||||
implementation, not another C32/QS/global-Ai/local reorder.
|
||||
- The next parity evidence should come from datacenter Blackwell hardware with
|
||||
the existing same-session serving harness plus graph-node decode profiles.
|
||||
|
||||
B200 rerun checklist:
|
||||
|
||||
1. Build and verify the llama.cpp paged binary on B200 or equivalent
|
||||
datacenter Blackwell hardware with the correct CUDA architecture/settings.
|
||||
2. Install and verify vLLM `0.23.0+` with the intended Blackwell backend stack.
|
||||
3. Confirm both model forms exist: `q36-35b-a3b-nvfp4.gguf` and
|
||||
`q36-35b-a3b-nvfp4-vllm`.
|
||||
4. Run `paged-current-serving-snapshot.sh` with `NPL="8 32 128"`, `PTOK=128`,
|
||||
`GEN=64`, `PARALLEL=128`, `CTX=131072`, and B200-specific
|
||||
`VLLM_GPU_MEMORY_UTILIZATION`, `VLLM_MAX_NUM_SEQS`, and
|
||||
`VLLM_TENSOR_PARALLEL_SIZE`.
|
||||
5. Before interpreting the artifact, require `hardware.txt` to say
|
||||
`hardware_class=datacenter_blackwell`, `gate_summary.tsv` to be green,
|
||||
pre/post MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT` and
|
||||
`MUL_MAT_ID` op gates green, and `summary.tsv` rows for both paged and vLLM.
|
||||
6. Run decode/profile reruns with `nsys --cuda-graph-trace=node` and inspect
|
||||
whether vLLM is using native FP4/CUTLASS/FlashInfer rather than the GB10
|
||||
Marlin fallback.
|
||||
|
||||
Standalone GDN source-work gate:
|
||||
|
||||
```sh
|
||||
nvcc -O3 -arch=sm_121a \
|
||||
~/scratch_tc_gdn_poc/gdn_blocked_solve_bench.cu \
|
||||
-o ~/scratch_tc_gdn_poc/gdn_blocked_solve_bench
|
||||
|
||||
~/scratch_tc_gdn_poc/gdn_blocked_solve_bench \
|
||||
--c 64 --dk 128 --dv 128 \
|
||||
--iters 1000 \
|
||||
--precision tf32,offdiag3x,apply3x \
|
||||
--oracle f64 \
|
||||
--dump-json ~/bench/phase73_gdn_blocked_solve_poc.json
|
||||
```
|
||||
|
||||
Do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for this larger path until
|
||||
that standalone artifact shows a material timing win, non-catastrophic weak and
|
||||
mixed decay error, plausible register/shared-memory fit, and records timing,
|
||||
precision-rung error, and condition-number distribution.
|
||||
|
||||
@@ -1586,6 +1586,16 @@ green, but min32 regressed every tested concurrency: aggregate ratios
|
||||
ratios `1.0379`/`1.0977`/`1.0300` at `n=8/32/128`. Keep min32 opt-in only and
|
||||
do not default it on GB10.
|
||||
|
||||
Phase73 made the post-Phase72 next-step decision. It ran no new benchmark and
|
||||
changed no llama.cpp source. Grouped-MMQ/W4A16 GB10 source work is closed:
|
||||
Phase61 direct-A was the last structurally distinct W4A16 shortcut and failed
|
||||
its keep gate, and Phase66 quantize plus gather was only `5.10%`. GDN backend
|
||||
source work is also gated: Phase71 kept M5 as shipped, and the remaining GDN
|
||||
gap is a FLA/CuteDSL-class C=64 blocked-solve/register-state implementation,
|
||||
not another local reorder. The next parity evidence should be a datacenter
|
||||
Blackwell same-session rerun, or a standalone GDN blocked-solve PoC before any
|
||||
backend GDN source work.
|
||||
|
||||
### Phase 60 current W4A16 prefill profile
|
||||
|
||||
Phase60 re-profiled the current W4A16 grouped MoE prefill path after the
|
||||
@@ -1861,8 +1871,10 @@ revalidated it against sequential-disabled and serial-chunked baselines, and
|
||||
Phase10/11/13 rejected the smaller follow-up GDN reorders. Phase41/43 closed
|
||||
D1 on the current GB10 path unless a fresh route trace proves a host-sync
|
||||
fallback returned. Phase60/61/66 rejected another small W4A16/direct-A or
|
||||
quant/gather pass. Treat the list below as pre-Phase60 planning context, not an
|
||||
active queue.
|
||||
quant/gather pass. Phase72 rejected min32 as a broad serving default, and
|
||||
Phase73 set the active queue to datacenter-Blackwell rerun readiness or a
|
||||
standalone GDN blocked-solve PoC before source work. Treat the list below as
|
||||
pre-Phase60 planning context, not an active queue.
|
||||
|
||||
Ranked, each with its pass-gate:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user