mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): record GDN blocked-solve PoC phase
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -12,10 +12,12 @@ with artifact path, gates, benchmark rows, and decision.
|
||||
- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Current tested source: DGX mirror
|
||||
`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
|
||||
- Latest attempt: Phase73.
|
||||
- Latest decision: no new GB10 benchmark or source patch. The next parity
|
||||
evidence requires a datacenter Blackwell rerun, or a standalone GDN
|
||||
blocked-solve PoC before any backend GDN source work.
|
||||
- Latest attempt: Phase74.
|
||||
- Latest decision: standalone C=64 GDN solve/apply PoC did not fund backend
|
||||
source work. Explicit inverse-plus-apply was only `0.594x`/`0.593x` the
|
||||
direct solve/apply baseline for weak/mixed decay, so the next parity evidence
|
||||
should be a datacenter Blackwell rerun or a substantially different TC solve
|
||||
PoC.
|
||||
|
||||
## Current Serving Record
|
||||
|
||||
@@ -55,6 +57,38 @@ Decision:
|
||||
|
||||
## Attempt Log
|
||||
|
||||
### Phase74: GDN Blocked-Solve PoC Gate
|
||||
|
||||
- Date: 2026-07-01.
|
||||
- Plan:
|
||||
`docs/superpowers/plans/2026-07-01-gdn-blocked-solve-poc-phase74.md`.
|
||||
- Artifact:
|
||||
`/home/mudler/bench/phase74_gdn_blocked_solve_poc/20260701_143711`.
|
||||
- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
|
||||
- Result type: standalone CUDA microbenchmark only; no llama.cpp source change.
|
||||
- Toolchain: CUDA `13.0.88`, `nvcc -O3 -arch=sm_121a`.
|
||||
- Hardware: NVIDIA GB10, `cc=12.1`, `48` SMs, `99 KB` dynamic shared memory.
|
||||
- Shape: `C=64`, `DK=128`, `DV=128`, `chunks=4096`, `iters=1000`.
|
||||
- Shared memory: direct solve/apply `81920` bytes; inverse-plus-apply
|
||||
`98304` bytes.
|
||||
|
||||
Result:
|
||||
|
||||
| case | direct ms | inverse+apply ms | inverse/direct speed | direct NMSE | inverse NMSE | direct max abs | inverse max abs | max lower row sum |
|
||||
|------|----------:|-----------------:|---------------------:|------------:|-------------:|---------------:|----------------:|------------------:|
|
||||
| weak decay | `3.263936` | `5.493515` | `0.5941x` | `2.081e-14` | `2.755e-15` | `8.890e-07` | `2.415e-07` | `4.072` |
|
||||
| mixed decay | `3.275959` | `5.527584` | `0.5927x` | `1.981e-14` | `7.541e-16` | `8.115e-07` | `7.888e-08` | `1.635` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Reject this explicit inverse-plus-apply shape as a backend source candidate on
|
||||
GB10. It is numerically clean but materially slower than direct solve/apply.
|
||||
- Do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for the larger C=64 path
|
||||
based on this attempt.
|
||||
- A future GDN source-work gate would need a substantially different
|
||||
tensor-core blocked solve/register-state design, not this shared-memory
|
||||
inverse scaffold.
|
||||
|
||||
### Phase73: Datacenter Blackwell Rerun Readiness
|
||||
|
||||
- Date: 2026-07-01.
|
||||
|
||||
@@ -1224,7 +1224,7 @@ B200 rerun checklist:
|
||||
whether vLLM is using native FP4/CUTLASS/FlashInfer rather than the GB10
|
||||
Marlin fallback.
|
||||
|
||||
Standalone GDN source-work gate:
|
||||
Phase74 standalone GDN source-work gate result:
|
||||
|
||||
```sh
|
||||
nvcc -O3 -arch=sm_121a \
|
||||
@@ -1236,10 +1236,23 @@ nvcc -O3 -arch=sm_121a \
|
||||
--iters 1000 \
|
||||
--precision tf32,offdiag3x,apply3x \
|
||||
--oracle f64 \
|
||||
--dump-json ~/bench/phase73_gdn_blocked_solve_poc.json
|
||||
--dump-json ~/bench/phase74_gdn_blocked_solve_poc/20260701_143711/phase74_gdn_blocked_solve_poc.json
|
||||
```
|
||||
|
||||
Do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for this larger path until
|
||||
that standalone artifact shows a material timing win, non-catastrophic weak and
|
||||
mixed decay error, plausible register/shared-memory fit, and records timing,
|
||||
precision-rung error, and condition-number distribution.
|
||||
Artifact:
|
||||
`/home/mudler/bench/phase74_gdn_blocked_solve_poc/20260701_143711`.
|
||||
|
||||
The standalone C=64 shared-memory explicit inverse-plus-apply scaffold did not
|
||||
fund backend source work:
|
||||
|
||||
- weak decay: direct solve/apply `3.263936 ms`; inverse-plus-apply
|
||||
`5.493515 ms`; inverse/direct speed `0.5941x`; inverse NMSE `2.755e-15`;
|
||||
- mixed decay: direct solve/apply `3.275959 ms`; inverse-plus-apply
|
||||
`5.527584 ms`; inverse/direct speed `0.5927x`; inverse NMSE `7.541e-16`;
|
||||
- shared memory was already near the GB10 cap: direct `81920` bytes,
|
||||
inverse-plus-apply `98304` bytes, with `99 KB` opt-in available.
|
||||
|
||||
Decision: do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for this C=64
|
||||
inverse scaffold on GB10. A future GDN source-work gate must be a substantially
|
||||
different tensor-core blocked-solve/register-state design that shows a material
|
||||
timing win before backend changes.
|
||||
|
||||
Reference in New Issue
Block a user