docs(paged): record BF16 GDN state cache phase

Record Phase81 default-off BF16 persistent S-cache results, including md5 drift, op gates, decode profile, and KL smoke. Scope Phase82 as full f16-reference KL plus serving A/B before patch-series promotion. Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 16:26:09 +00:00
parent d091eb30f2
commit 67d2c4c9d4
2 changed files with 91 additions and 7 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
@@ -12,13 +12,15 @@ with artifact path, gates, benchmark rows, and decision.
 - Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
 - Current tested source: DGX mirror
  `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Latest attempt: Phase80.
- Latest decision: default-off source A/B `GDN_ASSUME_IDENTITY_IDS=1` is
-  rejected for carry-forward/default. It removed the tiny `gdn_gather` bucket
-  (`0.79 ms`) and kept gates green, but left `gdn_core` effectively unchanged
-  (`1411.65 ms` baseline vs `1409.28 ms` candidate). The assumption is too
-  narrow for the measured gain. The next real GDN lever remains recurrent-state
-  precision/traffic, with a separate invasive scope.
+- Latest attempt: Phase81.
+- Latest decision: default-off `LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16` is a
+  promising carry-forward candidate, not default-on. Same-source decode-only
+  profiling cut normalized `gdn_core` from about `2.34 ms/launch` to
+  `1.20 ms/launch` and total kernel time from `3.6157 s` to `3.5244 s`, but MoE
+  greedy md5 changed from the paged canonical `8cb0...` to `07db...`. Dense md5
+  stayed canonical and `MUL_MAT`/`MUL_MAT_ID` gates stayed green. Phase82 must
+  run the full f16-reference KL gate and serving A/B before this can be promoted
+  beyond an opt-in experiment.

 ## Current Serving Record

@@ -58,6 +60,76 @@ Decision:

 ## Attempt Log

+### Phase81: Qwen35 BF16 Persistent GDN S-Cache
+
+- Date: 2026-07-01.
+- Source: `/home/mudler/llama-phase81-bf16-state-source`, local fork patch in
+  `/home/mudler/_git/llama.cpp` branch `localai-paged`.
+- Build artifact: `/home/mudler/llama-phase81-bf16-state-source/build-cuda`.
+- Gate artifact:
+  `/home/mudler/bench/phase81_bf16_s_cache_gates/20260701_161350`.
+- Profile artifacts:
+  - default F32:
+    `/home/mudler/bench/phase81_bf16_s_cache_profile/default_20260701_162117`
+  - BF16 S-cache:
+    `/home/mudler/bench/phase81_bf16_s_cache_profile/bf16_20260701_162028`
+- KL smoke artifact:
+  `/home/mudler/bench/phase81_bf16_s_cache_kl/20260701_162322`.
+- Result type: source experiment. `LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16`
+  stores Qwen35/Qwen35MoE persistent recurrent S cache in BF16 while keeping GDN
+  recurrence math, q/k/v/g/beta, and output in F32. Default remains F32.
+
+Implementation scope:
+
+- Added BF16 state support for `ggml_gated_delta_net_inplace_ids` only.
+- Added CPU/CUDA BF16 state load/store conversion at the persistent cache
+  boundary.
+- Added BF16 CPU/CUDA `SCALE` support because recurrent cache zeroing uses
+  `ggml_scale_inplace(..., 0)` on the S cache.
+- Added tests for BF16 `GATED_DELTA_NET_INPLACE_IDS` and BF16 in-place `SCALE`.
+
+Local verification:
+
+| check | result |
+|-------|--------|
+| RED test before implementation | `ggml_gated_delta_net_inplace_ids` rejected BF16 state at `state->type == GGML_TYPE_F32` |
+| CPU `SCALE -p bf16` | `1/1` passed |
+| CPU `GATED_DELTA_NET_INPLACE_IDS` | `2/2` passed |
+| DGX CUDA build | completed for `llama-completion`, `llama-batched-bench`, `test-backend-ops`, `llama-server`, later `llama-perplexity` |
+
+Gates:
+
+| mode | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
+|------|---------|-----------|-----------|--------------|
+| default F32 | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+| BF16 S-cache | `07db32c2bcb78d17a43ed18bc22705cd` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+
+Profile:
+
+| arm | env | active slots | depth start | depth mid | total kernel s | GDN ms | GDN share | `gdn_core` ms | `gdn_core` launches | `gdn_core`/launch | `mmq_nvfp4` ms |
+|-----|-----|-------------:|------------:|----------:|---------------:|-------:|----------:|--------------:|--------------------:|------------------:|---------------:|
+| default F32 | none | `128` | `65` | `87` | `3.6157` | `1480.44` | `40.94%` | `1399.30` | `599` | `2.336 ms` | `1394.28` |
+| BF16 S-cache | `LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16` | `128` | `65` | `91` | `3.5244` | `961.61` | `27.28%` | `863.57` | `720` | `1.199 ms` | `1665.38` |
+
+KL smoke against same-source F32 base:
+
+| check | result |
+|-------|--------|
+| shape | MoE, `-c 256 -b 256 --chunks 32`, Wikitext-2 raw |
+| F32 floor KLD vs F32 base | `0.000000 +/- 0.000000`, same-top-p `99.975%` |
+| BF16 S-cache KLD vs F32 base | `0.055499 +/- 0.001705`, same-top-p `88.361%` |
+| BF16 PPL ratio vs F32 base | `1.010356 +/- 0.005817` |
+
+Decision:
+
+- Carry forward as a default-off candidate and run Phase82 full gates.
+- Do not make it default-on: MoE greedy md5 is not canonical, and the KL smoke is
+  not the full f16-reference acceptance gate.
+- Required Phase82 before patch-series promotion:
+  full f16-reference KL gate for MoE and dense, same-source serving A/B against
+  F32 default and vLLM, then regenerate LocalAI patches from the fork only if
+  serving and KL both hold.
+
 ### Phase80: GDN Identity-Ids Shortcut Source A/B

 - Date: 2026-07-01.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -45,6 +45,18 @@ Read order for a cold start:
 > llama.cpp QoS knob, not a parity-closing lever. The trace and scheduler commits
 > are local and DGX-gated but not pushed, so the LocalAI patch series has not
 > been regenerated.
+>
+> 2026-07-01 Phase81 update: the next viable GDN lever is no longer launch
+> shape or gather removal. A default-off Qwen35/Qwen35MoE BF16 persistent
+> recurrent S-cache experiment (`LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16`) cut
+> same-source decode-only `gdn_core` from `1399.30 ms / 599 launches`
+> (`2.34 ms/launch`) to `863.57 ms / 720 launches` (`1.20 ms/launch`). Default
+> F32 md5 gates and op gates stayed green, and BF16 dense md5 stayed canonical,
+> but BF16 MoE md5 changed to `07db32c2bcb78d17a43ed18bc22705cd`. A quick
+> MoE KL smoke vs the same-source F32 base showed KLD `0.055499 +/- 0.001705`,
+> same-top-p `88.361%`, and PPL ratio `1.010356`. Treat this as a promising
+> default-off candidate only. Phase82 must run the full f16-reference KL gate
+> plus serving A/B before regenerating LocalAI patches or considering promotion.

 - Historical verdict: the older investigation marked GB10 parity **CLOSED** and
  unreachable. Treat that as superseded where Phase50-54 provide newer dense