mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): record post-PoC GDN audit phase
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -12,12 +12,13 @@ with artifact path, gates, benchmark rows, and decision.
|
||||
- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Current tested source: DGX mirror
|
||||
`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
|
||||
- Latest attempt: Phase74.
|
||||
- Latest decision: standalone C=64 GDN solve/apply PoC did not fund backend
|
||||
source work. Explicit inverse-plus-apply was only `0.594x`/`0.593x` the
|
||||
direct solve/apply baseline for weak/mixed decay, so the next parity evidence
|
||||
should be a datacenter Blackwell rerun or a substantially different TC solve
|
||||
PoC.
|
||||
- Latest attempt: Phase75.
|
||||
- Latest decision: subagent codebase audit found no source-funded GB10 GDN
|
||||
backend change. Phase74 rejects the C=64 inverse scaffold; vLLM's distinct
|
||||
one-token recurrent decode path is not on the current llama.cpp critical path
|
||||
because prior profiles showed GDN decode already faster than vLLM and serving
|
||||
decode host/MoE-sync bound. The next parity evidence should be a datacenter
|
||||
Blackwell rerun, or a fresh profile proving a different GB10 bottleneck.
|
||||
|
||||
## Current Serving Record
|
||||
|
||||
@@ -57,6 +58,54 @@ Decision:
|
||||
|
||||
## Attempt Log
|
||||
|
||||
### Phase75: Post-PoC GDN/VLLM Audit
|
||||
|
||||
- Date: 2026-07-01.
|
||||
- Artifact: no new benchmark artifact.
|
||||
- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
|
||||
- Result type: subagent codebase audit and gate-setting only; no source change.
|
||||
- Inputs: Phase74 artifact
|
||||
`/home/mudler/bench/phase74_gdn_blocked_solve_poc/20260701_143711`,
|
||||
llama.cpp GDN implementation, vLLM FLA/GDN implementation, and parity docs.
|
||||
|
||||
Findings:
|
||||
|
||||
- llama.cpp already has the M5 tensor-core GDN path default-on under paged KV.
|
||||
It includes `KK/QK` mma, `KS/QS` 3xtf32 mma, `P*U` mma, explicit
|
||||
`T=A^-1`, `U=T*RHS`, and state carry `Kc^T*DU`.
|
||||
- The current backend path is fixed at `C=16` for GB10 shared-memory limits.
|
||||
The remaining C=64/register-state class is not a shortcut patch.
|
||||
- Phase74 tested a C=64 shared-memory explicit inverse-plus-apply scaffold and
|
||||
failed its source-work gate: inverse/direct speed was `0.5941x` weak decay
|
||||
and `0.5927x` mixed decay.
|
||||
- vLLM has a structurally different one-token recurrent decode kernel that
|
||||
updates state directly without chunk inverse, and a packed decode path that
|
||||
avoids Q/K/V materialization copies. This is not currently source-funded in
|
||||
llama.cpp because prior parity profiles showed llama.cpp GDN decode faster
|
||||
than vLLM and decode serving dominated by host/MoE synchronization.
|
||||
- vLLM's CuTeDSL GDN prefill path uses SM10x/CUDA-13 Blackwell features
|
||||
including TMA/tcgen05/CUTLASS DSL. Treat it as datacenter-Blackwell reference
|
||||
evidence unless GB10 support is proven in the local toolchain.
|
||||
|
||||
Decision:
|
||||
|
||||
- Do not start GB10 GDN backend source work after Phase74/75.
|
||||
- Do not start a packed/recurrent GDN decode PoC unless a fresh same-session
|
||||
profile shows GDN decode or Q/K/V materialization back on the critical path.
|
||||
- Phase75 acceptance gate for the next real parity attempt is a datacenter
|
||||
Blackwell serving rerun with the Phase72 shape:
|
||||
`NPL=8 32 128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`, production defaults.
|
||||
- The rerun is valid only if `hardware.txt` records
|
||||
`hardware_class=datacenter_blackwell`, pre/post md5 gates are green
|
||||
(`8cb0ce23777bf55f92f63d0292c756b0`,
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`), `MUL_MAT 1146/1146` and
|
||||
`MUL_MAT_ID 806/806` are green, and decode profiles include
|
||||
`nsys --cuda-graph-trace=node`.
|
||||
- If datacenter Blackwell materially lifts llama/vLLM decode ratios above the
|
||||
GB10 Phase72 record (`0.7561`, `0.7158`, `0.6935`), continue parity work on
|
||||
that surface. If not, record the residual gap as engine/kernel architecture
|
||||
rather than GB10 memory bandwidth and keep GB10 GDN stopped.
|
||||
|
||||
### Phase74: GDN Blocked-Solve PoC Gate
|
||||
|
||||
- Date: 2026-07-01.
|
||||
|
||||
@@ -1256,3 +1256,26 @@ Decision: do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for this C=64
|
||||
inverse scaffold on GB10. A future GDN source-work gate must be a substantially
|
||||
different tensor-core blocked-solve/register-state design that shows a material
|
||||
timing win before backend changes.
|
||||
|
||||
Phase75 follow-up audit:
|
||||
|
||||
- llama.cpp already ships the M5 tensor-core GDN path default-on under paged KV:
|
||||
`KK/QK`, `KS/QS`, `P*U`, explicit `T=A^-1`, `U=T*RHS`, and
|
||||
`Kc^T*DU` state carry are covered in the current `C=16` GB10 path.
|
||||
- vLLM has a distinct one-token recurrent decode path that updates state
|
||||
directly and a packed decode path that avoids Q/K/V materialization copies,
|
||||
but this is not source-funded in llama.cpp without a fresh profile: prior
|
||||
parity evidence showed llama.cpp GDN decode already faster than vLLM and
|
||||
decode serving dominated by host/MoE synchronization.
|
||||
- vLLM's CuTeDSL GDN prefill path is useful reference material for datacenter
|
||||
Blackwell, but depends on SM10x/CUDA-13 features such as TMA/tcgen05/CUTLASS
|
||||
DSL and should not be treated as a portable GB10 patch base until the local
|
||||
toolchain proves support.
|
||||
|
||||
Current next gate: run the Phase72 same-session serving shape on B200/B100/GB200
|
||||
or equivalent datacenter Blackwell hardware. Require `hardware_class` to be
|
||||
`datacenter_blackwell`, pre/post md5 and op gates to be green, and graph-node
|
||||
decode profiles for both llama.cpp and vLLM. If the rerun does not materially
|
||||
raise the GB10 Phase72 decode ratios (`0.7561`, `0.7158`, `0.6935`), keep GB10
|
||||
GDN source work stopped and classify the residual gap as engine/kernel
|
||||
architecture rather than GB10 memory bandwidth.
|
||||
|
||||
Reference in New Issue
Block a user