paged: drop bf16-tau (patch 0026), subsumed by decode fusions (tau=100000 flat, zero speed benefit)

The opt-in hybrid per-head bf16 SSM-state lever (ssm_bf16_tau, patch 0026) is removed from the llama-cpp-localai-paged patch series. Clean re-measurement after the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache) landed shows it buys nothing: forcing ALL gated-DeltaNet heads to bf16 (tau=100000, the most aggressive setting) gives flat decode throughput, 780.6 vs 780.0 t/s. The mode engages but adds zero speed because it is subsumed by the fusions. The earlier "+12%" was measured before the fusions completed. bf16-tau was a precision trade (not bit-exact, ~91% same-top-p) plus extra bug surface and extra CUDA template-instantiation compile cost with no offsetting benefit. Dependency check: no later patch (0028/0029/0030) depends on 0026. 0030's only mention is a description comment; its code keys off fused_gdn_ar/ch/auto_fgdn, which originate in 0018/0019/0021 (before 0026). The remaining series (0001-0025, 0028-0030) applies clean with git apply --check against the pin 0ed235ea2c17a19fc8238668653946721ed136fd. The Makefile applies the series by glob (patches/paged/0*.patch); the resulting gap at 0026 is tolerated (0005/0027 are already absent). Removed: - patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch - the dead ssm_bf16_tau / ssm_hybrid_tau option handler in the shared grpc-server.cpp (it only set LLAMA_SSM_BF16_TAU, now a no-op the library no longer reads) - the patched+bf16-tau benchmark columns and llama-patched-bf16tau rows (README + final_benchmark.csv), the ssm_bf16_tau option text in backend index.yaml, the gallery NOTE block, and the docs/features/backends.md mention. The rejected-lever lesson is kept (why it was dropped: subsumed, tau=100000 flat) in the backend README section 5, the paged-backend agent guide, and the vLLM-parity methodology, so it is not re-tried. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-29 19:06:43 -04:00 · 2026-06-28 16:06:06 +00:00
parent 2c59805267
commit 4cd90bfae9
9 changed files with 75 additions and 2187 deletions
--- a/.agents/llama-cpp-localai-paged-backend.md
+++ b/.agents/llama-cpp-localai-paged-backend.md
@@ -42,8 +42,11 @@ how-to.
  dev doc or `*.md`. Strict `git apply` on a clean checkout must reach exit 0. (A
  stray `SSM_DECODE_FIX_RESULTS.md` hunk in patch 0019 once broke the CI build.)
 - **Bit-exact by default.** Every shipped patch is byte-identical to the f32
-  baseline. The one opt-in precision trade (`ssm_bf16_tau`, patch 0026) defaults
-  off; never put it in a recommended/gallery config.
+  baseline. (The one opt-in precision trade, `ssm_bf16_tau` / patch 0026, was
+  DROPPED: it went flat once the decode fusions landed - forcing all gated-DeltaNet
+  heads to bf16 gave 780.6 vs 780.0 t/s, zero benefit - so the series is now
+  bit-exact end to end. Do not reintroduce a per-head SSM-precision lever; see the
+  rejected-levers note in the backend README section 5.)

 ## Maintaining the pin against new llama.cpp

--- a/.agents/vllm-parity-methodology.md
+++ b/.agents/vllm-parity-methodology.md
@@ -54,9 +54,15 @@ backend README.
  the runtime flag off (compiled-in wins survive the toggle). Cross-engine "% of vLLM"
  (batched-bench vs vLLM server+client) is *indicative*; always caveat the harness
  and config (context length alone shifted the MoE figure 76% <-> 86%).
- **The win may be a precision trade, not a free lever.** bf16 SSM state was +12%
-  but failed the f32 KL gate (vLLM keeps f32 too), so it ships default-off opt-in -
-  never in a recommended config.
+- **Re-measure a "win" after later levers land - it may evaporate.** bf16 SSM
+  state (the `ssm_bf16_tau` lever) benched +12% early and failed the f32 KL gate
+  (vLLM keeps f32 too), so it was kept default-off opt-in. Once the decode fusions
+  (recurrent-state gather-fusion + block-table cache) landed, a clean re-measure
+  forcing ALL gated-DeltaNet heads to bf16 (`tau=100000`) went **flat** - 780.6 vs
+  780.0 t/s. The "+12%" was subsumed by the fusions: the lever bought nothing, so
+  it was **dropped** (precision trade + bug surface + extra CUDA template-instantiation
+  compile cost, zero benefit). A win measured before the rest of the series is not a
+  win after it.
 - **Reject the obvious-but-wrong, with evidence.** A faster kernel that is off the
  critical path benches FLAT (the freed time becomes idle). Quantizing the bf16
  projections to NVFP4 cost ~6% PPL - and vLLM keeps them bf16 for the same reason.
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -142,14 +142,22 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact
 | 0023 | **NVFP4 activation-quantize de-dup** - the broadcast up/gate projections re-quantize the same token activation once per expert; quantize the unique token activations once and byte-copy them into the expert-gathered layout. The only NVFP4-specific patch. | yes (byte-identical) |
 | 0025 | **MoE decode re-graph** - keep CUDA graphs on for the grouped-MMQ MoE decode step (the upstream guard disables graphs conservatively; the grouped path has no host sync). Env-gated `LLAMA_MOE_FORCE_GRAPHS`. | yes (graph replay re-issues identical kernels) |

-### Pool reclaim, block-table cache, backend gate, opt-in bf16-SSM
+### Pool reclaim, block-table cache, backend gate

 | # | What it does | Bit-exact |
 |---|---|---|
 | 0024 | **Paged-pool burst-reclaim** - truncate trailing blocks on partial-tail `seq_rm`, defrag the free queue when idle, release blocks on slot completion. Fixes the long-server burst-degradation bug (post-burst prefill collapse 488->44 t/s, restored to 532). Host-side accounting only. | yes |
 | 0029 | **Block-table within-step host cache** - the block table is fixed for the whole step; cache it on first build and memcpy it for the other full-attention layers (get_block_table -87%/-91%). | yes, per path (paged-MoE ref `8cb0ce23`) |
 | 0030 | **Fused-op backend gate** - the fused GDN / discriminated SSM_CONV ops are CUDA-family + CPU only; force them off on any non-CUDA compute backend so a Vulkan/SYCL/Metal build can't silently run the wrong plain-conv kernel. | yes on CUDA (byte-identical pre-0030); safety gate elsewhere |
-| 0026 | **Hybrid per-head bf16 SSM state (opt-in)** - `--ssm-bf16-tau` / option `ssm_bf16_tau`: fast-decaying GDN heads (memory length below the tau threshold) persist state as bf16, halving that head's decode byte stream (~+12% decode). | default tau=0 = f32 = **bit-exact**; the bf16 mode is **NOT** bit-exact (~91% same-top-p) |
+
+> **Dropped: patch 0026 (hybrid per-head bf16 SSM state, `ssm_bf16_tau`).** Once
+> the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache)
+> landed, the bf16-SSM lever bought nothing: a clean re-measurement forcing **all**
+> gated-DeltaNet heads to bf16 (`tau=100000`) gives **flat** decode (780.6 vs
+> 780.0 t/s) - the mode engages but adds zero throughput because it is subsumed by
+> the fusions. It was a precision trade (not bit-exact) plus extra bug surface and
+> CUDA template-instantiation compile cost with no benefit, so it was removed. See
+> section 5 ("rejected / flat levers") for the full record.

 ---

@@ -164,22 +172,27 @@ swept over serving width `npl` in {8, 32, 64, 128}. Plots:
 [`qwen36_moe_decode_vs_npl.png`](docs/qwen36_moe_decode_vs_npl.png); raw data
 [`final_benchmark.csv`](docs/final_benchmark.csv).

-![NVFP4 decode throughput vs concurrency on GB10: llama.cpp standard vs vLLM vs LocalAI's llama.cpp patches, plus the opt-in bf16-tau ceiling](docs/qwen36_decode_overview.png)
+![NVFP4 decode throughput vs concurrency on GB10: llama.cpp standard vs vLLM vs LocalAI's llama.cpp patches](docs/qwen36_decode_overview.png)

-> **What was re-measured (2026-06-27).** The three llama columns - **stock**,
-> **patched**, and **patched+bf16-tau** - were all re-measured this session on one
-> consistent `llama-batched-bench` harness. The **vLLM** column is the
-> **prior-session reference** (kept as-is, *not* re-run this session). Per-run peak
+> The plot above also shows a third "bf16-tau" llama curve. That was the opt-in
+> `ssm_bf16_tau` lever (patch 0026), since **dropped** - a clean re-measurement
+> showed it flat once the decode fusions landed (see section 5). The numbers below
+> use only **stock** vs **patched** vs **vLLM**.
+
+> **What was re-measured (2026-06-27).** The two llama columns - **stock** and
+> **patched** - were re-measured this session on one consistent
+> `llama-batched-bench` harness. The **vLLM** column is the **prior-session
+> reference** (kept as-is, *not* re-run this session). Per-run peak
 > VRAM was *not* re-captured: the GB10's unified Grace-Blackwell LPDDR5x reports
 > `[N/A]` to `nvidia-smi --query-gpu=memory.used` and the bench does not print it
 > (the memory-advantage note below is the prior-session finding).

-### (a) + (b) Patched vs stock vs patched+bf16-tau vs vLLM
+### (a) + (b) Patched vs stock vs vLLM

 The **stock** column is a separate, unpatched llama.cpp built at this backend's
-**exact pin (`9d5d882d`)**; the **patched** and **patched+bf16-tau** columns are
+**exact pin (`9d5d882d`)**; the **patched** column is
 the paged binary, env/flag-toggled (`LLAMA_KV_PAGED=1`, plus
-`LLAMA_MOE_FORCE_GRAPHS=1` for MoE; bf16-tau adds `--ssm-bf16-tau 64`). All three
+`LLAMA_MOE_FORCE_GRAPHS=1` for MoE). Both
 run on the **same harness**, so "x over stock" is an apples-to-apples measure of
 the patch series. (Note: the patch series' dominant SSM decode fusions are
 compiled in, not env-gated - toggling `LLAMA_KV_PAGED` alone on the *patched*
@@ -190,36 +203,26 @@ cross-engine "% of vLLM" is **indicative, not apples-to-apples**.

 **Dense Qwen3.6-27B-NVFP4** (decode t/s):

-| npl | stock | patched | patched+bf16-tau | vLLM (prior) | patched x over stock | bf16-tau over patched |
-|----:|------:|--------:|-----------------:|-------------:|---------------------:|----------------------:|
-| 8   |  68.3 |   85.3 |             87.8 |         70.4 | 1.25x | +3%  |
-| 32  | 119.9 |  211.9 |            231.0 |        211.8 | 1.77x | +9%  |
-| 64  | 142.8 |  305.2 |            341.4 |        309.1 | 2.14x | +12% |
-| 128 | 155.1 |  382.1 |            446.1 |        418.8 | 2.46x | +17% |
+| npl | stock | patched | vLLM (prior) | patched x over stock |
+|----:|------:|--------:|-------------:|---------------------:|
+| 8   |  68.3 |   85.3 |         70.4 | 1.25x |
+| 32  | 119.9 |  211.9 |        211.8 | 1.77x |
+| 64  | 142.8 |  305.2 |        309.1 | 2.14x |
+| 128 | 155.1 |  382.1 |        418.8 | 2.46x |

 Dense **patched** is parity-to-ahead of vLLM (121 / 100 / 99 / 91% of vLLM across
-the widths); **patched+bf16-tau** is **ahead of vLLM at every width** (125 / 109 /
-110 / 107%).
+the widths).

 **MoE Qwen3.6-35B-A3B-NVFP4** (decode t/s):

-| npl | stock | patched | patched+bf16-tau | vLLM (prior) | patched x over stock | bf16-tau over patched |
-|----:|------:|--------:|-----------------:|-------------:|---------------------:|----------------------:|
-| 8   | 186.7 |  230.3 |            240.5 |        256.5 | 1.23x | +4%  |
-| 32  | 267.4 |  466.4 |            508.1 |        500.8 | 1.74x | +9%  |
-| 64  | 320.5 |  622.4 |            703.8 |        686.1 | 1.94x | +13% |
-| 128 | 347.2 |  784.3 |            918.0 |        882.2 | 2.26x | +17% |
+| npl | stock | patched | vLLM (prior) | patched x over stock |
+|----:|------:|--------:|-------------:|---------------------:|
+| 8   | 186.7 |  230.3 |        256.5 | 1.23x |
+| 32  | 267.4 |  466.4 |        500.8 | 1.74x |
+| 64  | 320.5 |  622.4 |        686.1 | 1.94x |
+| 128 | 347.2 |  784.3 |        882.2 | 2.26x |

-MoE **patched** is 90 / 93 / 91 / 89% of vLLM; **patched+bf16-tau** reaches
-parity-to-ahead (94 / 101 / 103 / 104%) at npl>=32.
-
-**On bf16-tau.** The `patched+bf16-tau` column uses `--ssm-bf16-tau 64` (no exact
-tau was recorded behind the documented "~+12%"; the flag help suggests 32/64, and
-64 bf16's more of the fast-decaying GDN heads). It is **opt-in and NOT bit-exact**
-(~91% same-top-p; see section 5) - it persists fast-decaying GDN head state as
-bf16 to halve that head's recurrence byte stream. Measured decode gain over
-patched grows with serving width: **+3-4% at npl8, ~+12-13% at npl64, +17% at
-npl128** (dense and MoE alike).
+MoE **patched** is 90 / 93 / 91 / 89% of vLLM.

 **Caveat on the vLLM column.** It is a **different harness** and a
 **prior-session** measurement (not re-run this session), so the cross-engine "% of
@@ -229,10 +232,8 @@ vLLM" is **indicative, not apples-to-apples**. Memory (prior session): llama use
 **Takeaway.** Re-measured this session, the patch series gives up to **2.46x
 (dense) / 2.26x (MoE)** over true-stock `9d5d882d` on the same harness (close to,
 slightly below, the prior 2.59x / 2.33x - llama was re-measured, vLLM kept).
-Opt-in **bf16-tau adds a further +3% to +17%** on top of patched (growing with
-width). Dense is **ahead of vLLM** with bf16-tau at every width; MoE **patched**
-sits at ~89-93% of the prior-session vLLM and **bf16-tau reaches parity-to-ahead**
-at npl>=32. The residual non-bf16 MoE gap is structural (see section 5).
+Dense is parity-to-ahead of vLLM; MoE **patched** sits at ~89-93% of the
+prior-session vLLM. The residual MoE gap is structural (see section 5).

 ### (c) Apple Silicon (M4, 16GB Metal) - does the patchset help here?

@@ -314,14 +315,20 @@ llama is losing. The MoE GEMM kernel is *not* where the gap lives.
  (The "the win was NVFP4-dense-quant, not the Marlin kernel" dense verdict
  carries over to MoE.)

-**Opt-in bf16-SSM fast mode** (patch 0026, `ssm_bf16_tau`). The design premise -
-that bf16 KL error concentrates in long-memory heads and can be removed by
-keeping them f32 - is **empirically refuted**: the error scales with the bf16
+**Opt-in bf16-SSM fast mode - DROPPED (was patch 0026, `ssm_bf16_tau`).** The
+design premise - that bf16 KL error concentrates in long-memory heads and can be
+removed by keeping them f32 - was already shaky: the error scales with the bf16
 head *count* and saturates (~0.06 MeanKLD / ~91% same-top-p) far below any useful
-byte saving, and the carry is byte-exact (genuine bf16 rounding, not a bug). The
-byte-saving (and ~+12% decode) is real but cannot meet a strict KL bar, so it
-ships **default-off (f32, bit-exact)** and opt-in only. Do not put a hybrid tau
-in a recommended/gallery config.
+byte saving. The lever was then **removed entirely** once the decode fusions
+(0028 recurrent-state gather-fusion + 0029 block-table cache) landed: a clean
+re-measurement that forced **all** gated-DeltaNet heads to bf16 (`tau=100000`,
+the most aggressive setting) gave **flat** decode throughput - **780.6 vs 780.0
+t/s**. The mode engages but buys **zero** speed; the earlier "+12%" was subsumed
+by the fusions. So bf16-tau was a precision trade (not bit-exact) plus extra bug
+surface and CUDA template-instantiation compile cost with **no** offsetting
+benefit, and patch 0026 was dropped from the series. Lesson recorded so it is not
+re-tried: do not reintroduce a per-head SSM-precision lever - the bandwidth it
+targeted is already recovered by the gather-fusion + block-table cache.

 ---

@@ -403,6 +410,6 @@ The benchmarked NVFP4 GGUFs are published and wired into the LocalAI gallery:

 Both gallery entries set `backend: llama-cpp-localai-paged` and the paged serving config
 (`paged_kv:true`, `max_batch_tokens`, `kv_unified:false`, `parallel`,
-`flash_attention:on`, `context_size`). They intentionally stay bit-exact (no
-`ssm_bf16_tau`). The full backend-split + gallery plan is in
+`flash_attention:on`, `context_size`). They are bit-exact. The full
+backend-split + gallery plan is in
 [`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md).
--- a/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv
+++ b/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv
@@ -7,10 +7,6 @@ q36-27b-nvfp4,llama-patched,8,85.3,915.1
 q36-27b-nvfp4,llama-patched,32,211.9,919.0
 q36-27b-nvfp4,llama-patched,64,305.2,923.5
 q36-27b-nvfp4,llama-patched,128,382.1,922.9
-q36-27b-nvfp4,llama-patched-bf16tau,8,87.8,919.2
-q36-27b-nvfp4,llama-patched-bf16tau,32,231.0,931.1
-q36-27b-nvfp4,llama-patched-bf16tau,64,341.4,930.7
-q36-27b-nvfp4,llama-patched-bf16tau,128,446.1,932.2
 q36-27b-nvfp4,vllm,8,70.4,2096.2
 q36-27b-nvfp4,vllm,32,211.8,2182.6
 q36-27b-nvfp4,vllm,64,309.1,2088.9
@@ -23,10 +19,6 @@ q36-35b-a3b-nvfp4,llama-patched,8,230.3,1510.3
 q36-35b-a3b-nvfp4,llama-patched,32,466.4,1969.2
 q36-35b-a3b-nvfp4,llama-patched,64,622.4,2122.8
 q36-35b-a3b-nvfp4,llama-patched,128,784.3,2177.0
-q36-35b-a3b-nvfp4,llama-patched-bf16tau,8,240.5,1539.8
-q36-35b-a3b-nvfp4,llama-patched-bf16tau,32,508.1,2031.7
-q36-35b-a3b-nvfp4,llama-patched-bf16tau,64,703.8,2151.8
-q36-35b-a3b-nvfp4,llama-patched-bf16tau,128,918.0,2212.3
 q36-35b-a3b-nvfp4,vllm,8,256.5,5186.5
 q36-35b-a3b-nvfp4,vllm,32,500.8,6223.4
 q36-35b-a3b-nvfp4,vllm,64,686.1,5926.5
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -854,27 +854,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
                    // If conversion fails, leave the per-slot cap unset (engine default)
                }
            }
-        // --- hybrid per-head bf16 SSM-state precision (patch 0026, qwen3.5 gated-DeltaNet decode) ---
-        // Opt-in reduced-precision fast mode for the recurrent SSM state: a gated-DeltaNet head whose
-        // memory length tau_h = 1/(|ssm_a|*softplus(ssm_dt)) tokens exceeds this threshold stays f32;
-        // faster-decaying heads persist their state as bf16, halving that head's dominant recurrence
-        // byte stream on decode. The value is the tau threshold in tokens (e.g. 32 / 64); 0 keeps every
-        // head f32 (the bit-exact default). Set BEFORE context init via LLAMA_SSM_BF16_TAU, consumed in
-        // common_context_params_to_llama (patch 0026) only when the --ssm-bf16-tau CLI flag is unset.
-        // Unset / non-positive => env untouched, so stock stays byte-identical and bit-exact (an
-        // externally exported LLAMA_SSM_BF16_TAU still works as an escape hatch). NOTE: this mode is
-        // NOT bit-exact (~91% same-top-p ceiling); see backend/cpp/llama-cpp-localai-paged/README.md (Dev notes).
-        } else if (!strcmp(optname, "ssm_bf16_tau") || !strcmp(optname, "ssm_hybrid_tau")) {
-            if (optval != NULL) {
-                try {
-                    float tau = std::stof(optval_str);
-                    if (tau > 0.0f) {
-                        setenv("LLAMA_SSM_BF16_TAU", std::to_string(tau).c_str(), 1);
-                    }
-                } catch (const std::exception& e) {
-                    // If conversion fails, leave the threshold unset (bit-exact f32 default)
-                }
-            }
        } else if (!strcmp(optname, "n_ctx_checkpoints") || !strcmp(optname, "ctx_checkpoints")) {
            if (optval != NULL) {
                try {
--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -83,10 +83,7 @@
    stock llama-cpp backend, with the LocalAI paged patch series applied
    (vendored in this backend). Tuned for NVFP4 dense / MoE on Blackwell / GB10. Reuses the
    llama-cpp gRPC server sources; the paged engine is gated at runtime by the
-    paged_kv / max_batch_tokens model options. Qwen3.5 gated-DeltaNet models can
-    additionally opt into the reduced-precision hybrid SSM-state fast mode with
-    the ssm_bf16_tau:<tokens> option (default off = bit-exact f32; non-bit-exact
-    when enabled).
+    paged_kv / max_batch_tokens model options.
  urls:
    - https://github.com/ggerganov/llama.cpp
  tags:
--- a/docs/content/features/backends.md
+++ b/docs/content/features/backends.md
@@ -125,7 +125,7 @@ For getting started, see the available backends in LocalAI here: https://github.
 LocalAI supports various types of backends:

 - **LLM Backends**: For running language models (e.g., llama.cpp, vLLM, SGLang, transformers, MLX)
-  - **`llama-cpp-localai-paged`**: LocalAI's paged-attention llama.cpp variant - on-demand paged KV cache plus a decode-first prefill budget, tuned for NVFP4 dense/MoE on Blackwell/GB10. Same upstream llama.cpp pin as the stock `llama-cpp` backend, reusing its gRPC server; the paged engine is enabled per-model via the `paged_kv` / `max_batch_tokens` options. For Qwen3.5 gated-DeltaNet (hybrid SSM) models you can additionally set `options: [ssm_bf16_tau:<tokens>]` to enable the reduced-precision hybrid SSM-state fast mode: fast-decaying recurrent heads (memory length tau below the threshold, e.g. `32` / `64`) persist their state as bf16, halving that head's decode byte stream. Default off (`0`) keeps every head f32 and is bit-exact; when enabled the mode is **not** bit-exact (~91% same-top-p ceiling - see `backend/cpp/llama-cpp-localai-paged/README.md` for the quality/throughput profile).
+  - **`llama-cpp-localai-paged`**: LocalAI's paged-attention llama.cpp variant - on-demand paged KV cache plus a decode-first prefill budget, tuned for NVFP4 dense/MoE on Blackwell/GB10. Same upstream llama.cpp pin as the stock `llama-cpp` backend, reusing its gRPC server; the paged engine is enabled per-model via the `paged_kv` / `max_batch_tokens` options.
 - **Speech-to-Text Backends**: For transcription (e.g., whisper.cpp, parakeet.cpp, faster-whisper, NeMo)
 - **Text-to-Speech Backends**: For speech synthesis (e.g., piper, Kokoro, VibeVoice, Qwen3-TTS)
 - **Sound Generation Backends**: For music and audio generation (e.g., ACE-Step)
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -14,14 +14,11 @@
 # GGUFs were re-quantized with a newer convert (origin/master) preserving the same
 # MOSTLY_NVFP4 format; a load check on the paged backend GPU build is the final gate.
 #
-# NOTE(ssm_bf16_tau): Qwen3.5 gated-DeltaNet (hybrid SSM) models can opt into the
-# reduced-precision hybrid SSM-state fast mode by adding `ssm_bf16_tau:<tokens>`
-# (e.g. 32 / 64) to a model's `options:` list - fast-decaying recurrent heads then
-# persist their state as bf16 (LLAMA_SSM_BF16_TAU), halving that head's decode byte
-# stream. Default off (0) = every head f32 = bit-exact; when enabled the mode is NOT
-# bit-exact (~91% same-top-p, beats vLLM dense) - see
-# backend/cpp/llama-cpp-localai-paged/README.md for the quality profile.
-# The two NVFP4 entries below intentionally stay bit-exact (no ssm_bf16_tau).
+# The two NVFP4 entries below are bit-exact (f32 SSM state). The opt-in
+# reduced-precision hybrid SSM-state lever (ssm_bf16_tau, patch 0026) was DROPPED:
+# clean measurements showed it flat once the decode fusions landed (forcing all
+# gated-DeltaNet heads to bf16 gave 780.6 vs 780.0 t/s, zero benefit) - see
+# backend/cpp/llama-cpp-localai-paged/README.md section 5.
 # =============================================================================
 - name: "qwen3.6-27b-nvfp4-paged"
  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"