diff --git a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md index 4f9fab52b..7ab380fd2 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md @@ -12,10 +12,12 @@ with artifact path, gates, benchmark rows, and decision. - Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. - Current tested source: DGX mirror `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. -- Latest attempt: Phase73. -- Latest decision: no new GB10 benchmark or source patch. The next parity - evidence requires a datacenter Blackwell rerun, or a standalone GDN - blocked-solve PoC before any backend GDN source work. +- Latest attempt: Phase74. +- Latest decision: standalone C=64 GDN solve/apply PoC did not fund backend + source work. Explicit inverse-plus-apply was only `0.594x`/`0.593x` the + direct solve/apply baseline for weak/mixed decay, so the next parity evidence + should be a datacenter Blackwell rerun or a substantially different TC solve + PoC. ## Current Serving Record @@ -55,6 +57,38 @@ Decision: ## Attempt Log +### Phase74: GDN Blocked-Solve PoC Gate + +- Date: 2026-07-01. +- Plan: + `docs/superpowers/plans/2026-07-01-gdn-blocked-solve-poc-phase74.md`. +- Artifact: + `/home/mudler/bench/phase74_gdn_blocked_solve_poc/20260701_143711`. +- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. +- Result type: standalone CUDA microbenchmark only; no llama.cpp source change. +- Toolchain: CUDA `13.0.88`, `nvcc -O3 -arch=sm_121a`. +- Hardware: NVIDIA GB10, `cc=12.1`, `48` SMs, `99 KB` dynamic shared memory. +- Shape: `C=64`, `DK=128`, `DV=128`, `chunks=4096`, `iters=1000`. +- Shared memory: direct solve/apply `81920` bytes; inverse-plus-apply + `98304` bytes. + +Result: + +| case | direct ms | inverse+apply ms | inverse/direct speed | direct NMSE | inverse NMSE | direct max abs | inverse max abs | max lower row sum | +|------|----------:|-----------------:|---------------------:|------------:|-------------:|---------------:|----------------:|------------------:| +| weak decay | `3.263936` | `5.493515` | `0.5941x` | `2.081e-14` | `2.755e-15` | `8.890e-07` | `2.415e-07` | `4.072` | +| mixed decay | `3.275959` | `5.527584` | `0.5927x` | `1.981e-14` | `7.541e-16` | `8.115e-07` | `7.888e-08` | `1.635` | + +Decision: + +- Reject this explicit inverse-plus-apply shape as a backend source candidate on + GB10. It is numerically clean but materially slower than direct solve/apply. +- Do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for the larger C=64 path + based on this attempt. +- A future GDN source-work gate would need a substantially different + tensor-core blocked solve/register-state design, not this shared-memory + inverse scaffold. + ### Phase73: Datacenter Blackwell Rerun Readiness - Date: 2026-07-01. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index bdcea4521..909c9d12f 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -1224,7 +1224,7 @@ B200 rerun checklist: whether vLLM is using native FP4/CUTLASS/FlashInfer rather than the GB10 Marlin fallback. -Standalone GDN source-work gate: +Phase74 standalone GDN source-work gate result: ```sh nvcc -O3 -arch=sm_121a \ @@ -1236,10 +1236,23 @@ nvcc -O3 -arch=sm_121a \ --iters 1000 \ --precision tf32,offdiag3x,apply3x \ --oracle f64 \ - --dump-json ~/bench/phase73_gdn_blocked_solve_poc.json + --dump-json ~/bench/phase74_gdn_blocked_solve_poc/20260701_143711/phase74_gdn_blocked_solve_poc.json ``` -Do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for this larger path until -that standalone artifact shows a material timing win, non-catastrophic weak and -mixed decay error, plausible register/shared-memory fit, and records timing, -precision-rung error, and condition-number distribution. +Artifact: +`/home/mudler/bench/phase74_gdn_blocked_solve_poc/20260701_143711`. + +The standalone C=64 shared-memory explicit inverse-plus-apply scaffold did not +fund backend source work: + +- weak decay: direct solve/apply `3.263936 ms`; inverse-plus-apply + `5.493515 ms`; inverse/direct speed `0.5941x`; inverse NMSE `2.755e-15`; +- mixed decay: direct solve/apply `3.275959 ms`; inverse-plus-apply + `5.527584 ms`; inverse/direct speed `0.5927x`; inverse NMSE `7.541e-16`; +- shared memory was already near the GB10 cap: direct `81920` bytes, + inverse-plus-apply `98304` bytes, with `99 KB` opt-in available. + +Decision: do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for this C=64 +inverse scaffold on GB10. A future GDN source-work gate must be a substantially +different tensor-core blocked-solve/register-state design that shows a material +timing win before backend changes.