docs(paged): record post-PoC GDN audit phase

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 14:44:17 +00:00
parent 5369219729
commit 26a41fad1a
2 changed files with 78 additions and 6 deletions

View File

@@ -12,12 +12,13 @@ with artifact path, gates, benchmark rows, and decision.
- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
- Current tested source: DGX mirror
`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Latest attempt: Phase74.
- Latest decision: standalone C=64 GDN solve/apply PoC did not fund backend
source work. Explicit inverse-plus-apply was only `0.594x`/`0.593x` the
direct solve/apply baseline for weak/mixed decay, so the next parity evidence
should be a datacenter Blackwell rerun or a substantially different TC solve
PoC.
- Latest attempt: Phase75.
- Latest decision: subagent codebase audit found no source-funded GB10 GDN
backend change. Phase74 rejects the C=64 inverse scaffold; vLLM's distinct
one-token recurrent decode path is not on the current llama.cpp critical path
because prior profiles showed GDN decode already faster than vLLM and serving
decode host/MoE-sync bound. The next parity evidence should be a datacenter
Blackwell rerun, or a fresh profile proving a different GB10 bottleneck.
## Current Serving Record
@@ -57,6 +58,54 @@ Decision:
## Attempt Log
### Phase75: Post-PoC GDN/VLLM Audit
- Date: 2026-07-01.
- Artifact: no new benchmark artifact.
- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Result type: subagent codebase audit and gate-setting only; no source change.
- Inputs: Phase74 artifact
`/home/mudler/bench/phase74_gdn_blocked_solve_poc/20260701_143711`,
llama.cpp GDN implementation, vLLM FLA/GDN implementation, and parity docs.
Findings:
- llama.cpp already has the M5 tensor-core GDN path default-on under paged KV.
It includes `KK/QK` mma, `KS/QS` 3xtf32 mma, `P*U` mma, explicit
`T=A^-1`, `U=T*RHS`, and state carry `Kc^T*DU`.
- The current backend path is fixed at `C=16` for GB10 shared-memory limits.
The remaining C=64/register-state class is not a shortcut patch.
- Phase74 tested a C=64 shared-memory explicit inverse-plus-apply scaffold and
failed its source-work gate: inverse/direct speed was `0.5941x` weak decay
and `0.5927x` mixed decay.
- vLLM has a structurally different one-token recurrent decode kernel that
updates state directly without chunk inverse, and a packed decode path that
avoids Q/K/V materialization copies. This is not currently source-funded in
llama.cpp because prior parity profiles showed llama.cpp GDN decode faster
than vLLM and decode serving dominated by host/MoE synchronization.
- vLLM's CuTeDSL GDN prefill path uses SM10x/CUDA-13 Blackwell features
including TMA/tcgen05/CUTLASS DSL. Treat it as datacenter-Blackwell reference
evidence unless GB10 support is proven in the local toolchain.
Decision:
- Do not start GB10 GDN backend source work after Phase74/75.
- Do not start a packed/recurrent GDN decode PoC unless a fresh same-session
profile shows GDN decode or Q/K/V materialization back on the critical path.
- Phase75 acceptance gate for the next real parity attempt is a datacenter
Blackwell serving rerun with the Phase72 shape:
`NPL=8 32 128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`, production defaults.
- The rerun is valid only if `hardware.txt` records
`hardware_class=datacenter_blackwell`, pre/post md5 gates are green
(`8cb0ce23777bf55f92f63d0292c756b0`,
`5951a5b4d624ce891e22ab5fca9bc439`), `MUL_MAT 1146/1146` and
`MUL_MAT_ID 806/806` are green, and decode profiles include
`nsys --cuda-graph-trace=node`.
- If datacenter Blackwell materially lifts llama/vLLM decode ratios above the
GB10 Phase72 record (`0.7561`, `0.7158`, `0.6935`), continue parity work on
that surface. If not, record the residual gap as engine/kernel architecture
rather than GB10 memory bandwidth and keep GB10 GDN stopped.
### Phase74: GDN Blocked-Solve PoC Gate
- Date: 2026-07-01.

View File

@@ -1256,3 +1256,26 @@ Decision: do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for this C=64
inverse scaffold on GB10. A future GDN source-work gate must be a substantially
different tensor-core blocked-solve/register-state design that shows a material
timing win before backend changes.
Phase75 follow-up audit:
- llama.cpp already ships the M5 tensor-core GDN path default-on under paged KV:
`KK/QK`, `KS/QS`, `P*U`, explicit `T=A^-1`, `U=T*RHS`, and
`Kc^T*DU` state carry are covered in the current `C=16` GB10 path.
- vLLM has a distinct one-token recurrent decode path that updates state
directly and a packed decode path that avoids Q/K/V materialization copies,
but this is not source-funded in llama.cpp without a fresh profile: prior
parity evidence showed llama.cpp GDN decode already faster than vLLM and
decode serving dominated by host/MoE synchronization.
- vLLM's CuTeDSL GDN prefill path is useful reference material for datacenter
Blackwell, but depends on SM10x/CUDA-13 features such as TMA/tcgen05/CUTLASS
DSL and should not be treated as a portable GB10 patch base until the local
toolchain proves support.
Current next gate: run the Phase72 same-session serving shape on B200/B100/GB200
or equivalent datacenter Blackwell hardware. Require `hardware_class` to be
`datacenter_blackwell`, pre/post md5 and op gates to be green, and graph-node
decode profiles for both llama.cpp and vLLM. If the rerun does not materially
raise the GB10 Phase72 decode ratios (`0.7561`, `0.7158`, `0.6935`), keep GB10
GDN source work stopped and classify the residual gap as engine/kernel
architecture rather than GB10 memory bandwidth.