docs(paged): record GDN blocked-solve PoC phase

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 14:39:09 +00:00
parent eb82ff138f
commit 5369219729
2 changed files with 57 additions and 10 deletions

View File

@@ -12,10 +12,12 @@ with artifact path, gates, benchmark rows, and decision.
- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
- Current tested source: DGX mirror
`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Latest attempt: Phase73.
- Latest decision: no new GB10 benchmark or source patch. The next parity
evidence requires a datacenter Blackwell rerun, or a standalone GDN
blocked-solve PoC before any backend GDN source work.
- Latest attempt: Phase74.
- Latest decision: standalone C=64 GDN solve/apply PoC did not fund backend
source work. Explicit inverse-plus-apply was only `0.594x`/`0.593x` the
direct solve/apply baseline for weak/mixed decay, so the next parity evidence
should be a datacenter Blackwell rerun or a substantially different TC solve
PoC.
## Current Serving Record
@@ -55,6 +57,38 @@ Decision:
## Attempt Log
### Phase74: GDN Blocked-Solve PoC Gate
- Date: 2026-07-01.
- Plan:
`docs/superpowers/plans/2026-07-01-gdn-blocked-solve-poc-phase74.md`.
- Artifact:
`/home/mudler/bench/phase74_gdn_blocked_solve_poc/20260701_143711`.
- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Result type: standalone CUDA microbenchmark only; no llama.cpp source change.
- Toolchain: CUDA `13.0.88`, `nvcc -O3 -arch=sm_121a`.
- Hardware: NVIDIA GB10, `cc=12.1`, `48` SMs, `99 KB` dynamic shared memory.
- Shape: `C=64`, `DK=128`, `DV=128`, `chunks=4096`, `iters=1000`.
- Shared memory: direct solve/apply `81920` bytes; inverse-plus-apply
`98304` bytes.
Result:
| case | direct ms | inverse+apply ms | inverse/direct speed | direct NMSE | inverse NMSE | direct max abs | inverse max abs | max lower row sum |
|------|----------:|-----------------:|---------------------:|------------:|-------------:|---------------:|----------------:|------------------:|
| weak decay | `3.263936` | `5.493515` | `0.5941x` | `2.081e-14` | `2.755e-15` | `8.890e-07` | `2.415e-07` | `4.072` |
| mixed decay | `3.275959` | `5.527584` | `0.5927x` | `1.981e-14` | `7.541e-16` | `8.115e-07` | `7.888e-08` | `1.635` |
Decision:
- Reject this explicit inverse-plus-apply shape as a backend source candidate on
GB10. It is numerically clean but materially slower than direct solve/apply.
- Do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for the larger C=64 path
based on this attempt.
- A future GDN source-work gate would need a substantially different
tensor-core blocked solve/register-state design, not this shared-memory
inverse scaffold.
### Phase73: Datacenter Blackwell Rerun Readiness
- Date: 2026-07-01.

View File

@@ -1224,7 +1224,7 @@ B200 rerun checklist:
whether vLLM is using native FP4/CUTLASS/FlashInfer rather than the GB10
Marlin fallback.
Standalone GDN source-work gate:
Phase74 standalone GDN source-work gate result:
```sh
nvcc -O3 -arch=sm_121a \
@@ -1236,10 +1236,23 @@ nvcc -O3 -arch=sm_121a \
--iters 1000 \
--precision tf32,offdiag3x,apply3x \
--oracle f64 \
--dump-json ~/bench/phase73_gdn_blocked_solve_poc.json
--dump-json ~/bench/phase74_gdn_blocked_solve_poc/20260701_143711/phase74_gdn_blocked_solve_poc.json
```
Do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for this larger path until
that standalone artifact shows a material timing win, non-catastrophic weak and
mixed decay error, plausible register/shared-memory fit, and records timing,
precision-rung error, and condition-number distribution.
Artifact:
`/home/mudler/bench/phase74_gdn_blocked_solve_poc/20260701_143711`.
The standalone C=64 shared-memory explicit inverse-plus-apply scaffold did not
fund backend source work:
- weak decay: direct solve/apply `3.263936 ms`; inverse-plus-apply
`5.493515 ms`; inverse/direct speed `0.5941x`; inverse NMSE `2.755e-15`;
- mixed decay: direct solve/apply `3.275959 ms`; inverse-plus-apply
`5.527584 ms`; inverse/direct speed `0.5927x`; inverse NMSE `7.541e-16`;
- shared memory was already near the GB10 cap: direct `81920` bytes,
inverse-plus-apply `98304` bytes, with `99 KB` opt-in available.
Decision: do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for this C=64
inverse scaffold on GB10. A future GDN source-work gate must be a substantially
different tensor-core blocked-solve/register-state design that shows a material
timing win before backend changes.