paged: drop bf16-tau (patch 0026), subsumed by decode fusions (tau=100000 flat, zero speed benefit)

The opt-in hybrid per-head bf16 SSM-state lever (ssm_bf16_tau, patch 0026) is
removed from the llama-cpp-localai-paged patch series. Clean re-measurement after
the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache)
landed shows it buys nothing: forcing ALL gated-DeltaNet heads to bf16
(tau=100000, the most aggressive setting) gives flat decode throughput, 780.6 vs
780.0 t/s. The mode engages but adds zero speed because it is subsumed by the
fusions. The earlier "+12%" was measured before the fusions completed. bf16-tau
was a precision trade (not bit-exact, ~91% same-top-p) plus extra bug surface and
extra CUDA template-instantiation compile cost with no offsetting benefit.

Dependency check: no later patch (0028/0029/0030) depends on 0026. 0030's only
mention is a description comment; its code keys off fused_gdn_ar/ch/auto_fgdn,
which originate in 0018/0019/0021 (before 0026). The remaining series (0001-0025,
0028-0030) applies clean with git apply --check against the pin
0ed235ea2c17a19fc8238668653946721ed136fd. The Makefile applies the series by glob
(patches/paged/0*.patch); the resulting gap at 0026 is tolerated (0005/0027 are
already absent).

Removed:
- patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch
- the dead ssm_bf16_tau / ssm_hybrid_tau option handler in the shared
  grpc-server.cpp (it only set LLAMA_SSM_BF16_TAU, now a no-op the library no
  longer reads)
- the patched+bf16-tau benchmark columns and llama-patched-bf16tau rows
  (README + final_benchmark.csv), the ssm_bf16_tau option text in backend
  index.yaml, the gallery NOTE block, and the docs/features/backends.md mention.

The rejected-lever lesson is kept (why it was dropped: subsumed, tau=100000 flat)
in the backend README section 5, the paged-backend agent guide, and the
vLLM-parity methodology, so it is not re-tried.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-28 16:06:06 +00:00
parent 2c59805267
commit 4cd90bfae9
9 changed files with 75 additions and 2187 deletions

View File

@@ -42,8 +42,11 @@ how-to.
dev doc or `*.md`. Strict `git apply` on a clean checkout must reach exit 0. (A
stray `SSM_DECODE_FIX_RESULTS.md` hunk in patch 0019 once broke the CI build.)
- **Bit-exact by default.** Every shipped patch is byte-identical to the f32
baseline. The one opt-in precision trade (`ssm_bf16_tau`, patch 0026) defaults
off; never put it in a recommended/gallery config.
baseline. (The one opt-in precision trade, `ssm_bf16_tau` / patch 0026, was
DROPPED: it went flat once the decode fusions landed - forcing all gated-DeltaNet
heads to bf16 gave 780.6 vs 780.0 t/s, zero benefit - so the series is now
bit-exact end to end. Do not reintroduce a per-head SSM-precision lever; see the
rejected-levers note in the backend README section 5.)
## Maintaining the pin against new llama.cpp

View File

@@ -54,9 +54,15 @@ backend README.
the runtime flag off (compiled-in wins survive the toggle). Cross-engine "% of vLLM"
(batched-bench vs vLLM server+client) is *indicative*; always caveat the harness
and config (context length alone shifted the MoE figure 76% <-> 86%).
- **The win may be a precision trade, not a free lever.** bf16 SSM state was +12%
but failed the f32 KL gate (vLLM keeps f32 too), so it ships default-off opt-in -
never in a recommended config.
- **Re-measure a "win" after later levers land - it may evaporate.** bf16 SSM
state (the `ssm_bf16_tau` lever) benched +12% early and failed the f32 KL gate
(vLLM keeps f32 too), so it was kept default-off opt-in. Once the decode fusions
(recurrent-state gather-fusion + block-table cache) landed, a clean re-measure
forcing ALL gated-DeltaNet heads to bf16 (`tau=100000`) went **flat** - 780.6 vs
780.0 t/s. The "+12%" was subsumed by the fusions: the lever bought nothing, so
it was **dropped** (precision trade + bug surface + extra CUDA template-instantiation
compile cost, zero benefit). A win measured before the rest of the series is not a
win after it.
- **Reject the obvious-but-wrong, with evidence.** A faster kernel that is off the
critical path benches FLAT (the freed time becomes idle). Quantizing the bf16
projections to NVFP4 cost ~6% PPL - and vLLM keeps them bf16 for the same reason.

View File

@@ -142,14 +142,22 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact
| 0023 | **NVFP4 activation-quantize de-dup** - the broadcast up/gate projections re-quantize the same token activation once per expert; quantize the unique token activations once and byte-copy them into the expert-gathered layout. The only NVFP4-specific patch. | yes (byte-identical) |
| 0025 | **MoE decode re-graph** - keep CUDA graphs on for the grouped-MMQ MoE decode step (the upstream guard disables graphs conservatively; the grouped path has no host sync). Env-gated `LLAMA_MOE_FORCE_GRAPHS`. | yes (graph replay re-issues identical kernels) |
### Pool reclaim, block-table cache, backend gate, opt-in bf16-SSM
### Pool reclaim, block-table cache, backend gate
| # | What it does | Bit-exact |
|---|---|---|
| 0024 | **Paged-pool burst-reclaim** - truncate trailing blocks on partial-tail `seq_rm`, defrag the free queue when idle, release blocks on slot completion. Fixes the long-server burst-degradation bug (post-burst prefill collapse 488->44 t/s, restored to 532). Host-side accounting only. | yes |
| 0029 | **Block-table within-step host cache** - the block table is fixed for the whole step; cache it on first build and memcpy it for the other full-attention layers (get_block_table -87%/-91%). | yes, per path (paged-MoE ref `8cb0ce23`) |
| 0030 | **Fused-op backend gate** - the fused GDN / discriminated SSM_CONV ops are CUDA-family + CPU only; force them off on any non-CUDA compute backend so a Vulkan/SYCL/Metal build can't silently run the wrong plain-conv kernel. | yes on CUDA (byte-identical pre-0030); safety gate elsewhere |
| 0026 | **Hybrid per-head bf16 SSM state (opt-in)** - `--ssm-bf16-tau` / option `ssm_bf16_tau`: fast-decaying GDN heads (memory length below the tau threshold) persist state as bf16, halving that head's decode byte stream (~+12% decode). | default tau=0 = f32 = **bit-exact**; the bf16 mode is **NOT** bit-exact (~91% same-top-p) |
> **Dropped: patch 0026 (hybrid per-head bf16 SSM state, `ssm_bf16_tau`).** Once
> the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache)
> landed, the bf16-SSM lever bought nothing: a clean re-measurement forcing **all**
> gated-DeltaNet heads to bf16 (`tau=100000`) gives **flat** decode (780.6 vs
> 780.0 t/s) - the mode engages but adds zero throughput because it is subsumed by
> the fusions. It was a precision trade (not bit-exact) plus extra bug surface and
> CUDA template-instantiation compile cost with no benefit, so it was removed. See
> section 5 ("rejected / flat levers") for the full record.
---
@@ -164,22 +172,27 @@ swept over serving width `npl` in {8, 32, 64, 128}. Plots:
[`qwen36_moe_decode_vs_npl.png`](docs/qwen36_moe_decode_vs_npl.png); raw data
[`final_benchmark.csv`](docs/final_benchmark.csv).
![NVFP4 decode throughput vs concurrency on GB10: llama.cpp standard vs vLLM vs LocalAI's llama.cpp patches, plus the opt-in bf16-tau ceiling](docs/qwen36_decode_overview.png)
![NVFP4 decode throughput vs concurrency on GB10: llama.cpp standard vs vLLM vs LocalAI's llama.cpp patches](docs/qwen36_decode_overview.png)
> **What was re-measured (2026-06-27).** The three llama columns - **stock**,
> **patched**, and **patched+bf16-tau** - were all re-measured this session on one
> consistent `llama-batched-bench` harness. The **vLLM** column is the
> **prior-session reference** (kept as-is, *not* re-run this session). Per-run peak
> The plot above also shows a third "bf16-tau" llama curve. That was the opt-in
> `ssm_bf16_tau` lever (patch 0026), since **dropped** - a clean re-measurement
> showed it flat once the decode fusions landed (see section 5). The numbers below
> use only **stock** vs **patched** vs **vLLM**.
> **What was re-measured (2026-06-27).** The two llama columns - **stock** and
> **patched** - were re-measured this session on one consistent
> `llama-batched-bench` harness. The **vLLM** column is the **prior-session
> reference** (kept as-is, *not* re-run this session). Per-run peak
> VRAM was *not* re-captured: the GB10's unified Grace-Blackwell LPDDR5x reports
> `[N/A]` to `nvidia-smi --query-gpu=memory.used` and the bench does not print it
> (the memory-advantage note below is the prior-session finding).
### (a) + (b) Patched vs stock vs patched+bf16-tau vs vLLM
### (a) + (b) Patched vs stock vs vLLM
The **stock** column is a separate, unpatched llama.cpp built at this backend's
**exact pin (`9d5d882d`)**; the **patched** and **patched+bf16-tau** columns are
**exact pin (`9d5d882d`)**; the **patched** column is
the paged binary, env/flag-toggled (`LLAMA_KV_PAGED=1`, plus
`LLAMA_MOE_FORCE_GRAPHS=1` for MoE; bf16-tau adds `--ssm-bf16-tau 64`). All three
`LLAMA_MOE_FORCE_GRAPHS=1` for MoE). Both
run on the **same harness**, so "x over stock" is an apples-to-apples measure of
the patch series. (Note: the patch series' dominant SSM decode fusions are
compiled in, not env-gated - toggling `LLAMA_KV_PAGED` alone on the *patched*
@@ -190,36 +203,26 @@ cross-engine "% of vLLM" is **indicative, not apples-to-apples**.
**Dense Qwen3.6-27B-NVFP4** (decode t/s):
| npl | stock | patched | patched+bf16-tau | vLLM (prior) | patched x over stock | bf16-tau over patched |
|----:|------:|--------:|-----------------:|-------------:|---------------------:|----------------------:|
| 8 | 68.3 | 85.3 | 87.8 | 70.4 | 1.25x | +3% |
| 32 | 119.9 | 211.9 | 231.0 | 211.8 | 1.77x | +9% |
| 64 | 142.8 | 305.2 | 341.4 | 309.1 | 2.14x | +12% |
| 128 | 155.1 | 382.1 | 446.1 | 418.8 | 2.46x | +17% |
| npl | stock | patched | vLLM (prior) | patched x over stock |
|----:|------:|--------:|-------------:|---------------------:|
| 8 | 68.3 | 85.3 | 70.4 | 1.25x |
| 32 | 119.9 | 211.9 | 211.8 | 1.77x |
| 64 | 142.8 | 305.2 | 309.1 | 2.14x |
| 128 | 155.1 | 382.1 | 418.8 | 2.46x |
Dense **patched** is parity-to-ahead of vLLM (121 / 100 / 99 / 91% of vLLM across
the widths); **patched+bf16-tau** is **ahead of vLLM at every width** (125 / 109 /
110 / 107%).
the widths).
**MoE Qwen3.6-35B-A3B-NVFP4** (decode t/s):
| npl | stock | patched | patched+bf16-tau | vLLM (prior) | patched x over stock | bf16-tau over patched |
|----:|------:|--------:|-----------------:|-------------:|---------------------:|----------------------:|
| 8 | 186.7 | 230.3 | 240.5 | 256.5 | 1.23x | +4% |
| 32 | 267.4 | 466.4 | 508.1 | 500.8 | 1.74x | +9% |
| 64 | 320.5 | 622.4 | 703.8 | 686.1 | 1.94x | +13% |
| 128 | 347.2 | 784.3 | 918.0 | 882.2 | 2.26x | +17% |
| npl | stock | patched | vLLM (prior) | patched x over stock |
|----:|------:|--------:|-------------:|---------------------:|
| 8 | 186.7 | 230.3 | 256.5 | 1.23x |
| 32 | 267.4 | 466.4 | 500.8 | 1.74x |
| 64 | 320.5 | 622.4 | 686.1 | 1.94x |
| 128 | 347.2 | 784.3 | 882.2 | 2.26x |
MoE **patched** is 90 / 93 / 91 / 89% of vLLM; **patched+bf16-tau** reaches
parity-to-ahead (94 / 101 / 103 / 104%) at npl>=32.
**On bf16-tau.** The `patched+bf16-tau` column uses `--ssm-bf16-tau 64` (no exact
tau was recorded behind the documented "~+12%"; the flag help suggests 32/64, and
64 bf16's more of the fast-decaying GDN heads). It is **opt-in and NOT bit-exact**
(~91% same-top-p; see section 5) - it persists fast-decaying GDN head state as
bf16 to halve that head's recurrence byte stream. Measured decode gain over
patched grows with serving width: **+3-4% at npl8, ~+12-13% at npl64, +17% at
npl128** (dense and MoE alike).
MoE **patched** is 90 / 93 / 91 / 89% of vLLM.
**Caveat on the vLLM column.** It is a **different harness** and a
**prior-session** measurement (not re-run this session), so the cross-engine "% of
@@ -229,10 +232,8 @@ vLLM" is **indicative, not apples-to-apples**. Memory (prior session): llama use
**Takeaway.** Re-measured this session, the patch series gives up to **2.46x
(dense) / 2.26x (MoE)** over true-stock `9d5d882d` on the same harness (close to,
slightly below, the prior 2.59x / 2.33x - llama was re-measured, vLLM kept).
Opt-in **bf16-tau adds a further +3% to +17%** on top of patched (growing with
width). Dense is **ahead of vLLM** with bf16-tau at every width; MoE **patched**
sits at ~89-93% of the prior-session vLLM and **bf16-tau reaches parity-to-ahead**
at npl>=32. The residual non-bf16 MoE gap is structural (see section 5).
Dense is parity-to-ahead of vLLM; MoE **patched** sits at ~89-93% of the
prior-session vLLM. The residual MoE gap is structural (see section 5).
### (c) Apple Silicon (M4, 16GB Metal) - does the patchset help here?
@@ -314,14 +315,20 @@ llama is losing. The MoE GEMM kernel is *not* where the gap lives.
(The "the win was NVFP4-dense-quant, not the Marlin kernel" dense verdict
carries over to MoE.)
**Opt-in bf16-SSM fast mode** (patch 0026, `ssm_bf16_tau`). The design premise -
that bf16 KL error concentrates in long-memory heads and can be removed by
keeping them f32 - is **empirically refuted**: the error scales with the bf16
**Opt-in bf16-SSM fast mode - DROPPED (was patch 0026, `ssm_bf16_tau`).** The
design premise - that bf16 KL error concentrates in long-memory heads and can be
removed by keeping them f32 - was already shaky: the error scales with the bf16
head *count* and saturates (~0.06 MeanKLD / ~91% same-top-p) far below any useful
byte saving, and the carry is byte-exact (genuine bf16 rounding, not a bug). The
byte-saving (and ~+12% decode) is real but cannot meet a strict KL bar, so it
ships **default-off (f32, bit-exact)** and opt-in only. Do not put a hybrid tau
in a recommended/gallery config.
byte saving. The lever was then **removed entirely** once the decode fusions
(0028 recurrent-state gather-fusion + 0029 block-table cache) landed: a clean
re-measurement that forced **all** gated-DeltaNet heads to bf16 (`tau=100000`,
the most aggressive setting) gave **flat** decode throughput - **780.6 vs 780.0
t/s**. The mode engages but buys **zero** speed; the earlier "+12%" was subsumed
by the fusions. So bf16-tau was a precision trade (not bit-exact) plus extra bug
surface and CUDA template-instantiation compile cost with **no** offsetting
benefit, and patch 0026 was dropped from the series. Lesson recorded so it is not
re-tried: do not reintroduce a per-head SSM-precision lever - the bandwidth it
targeted is already recovered by the gather-fusion + block-table cache.
---
@@ -403,6 +410,6 @@ The benchmarked NVFP4 GGUFs are published and wired into the LocalAI gallery:
Both gallery entries set `backend: llama-cpp-localai-paged` and the paged serving config
(`paged_kv:true`, `max_batch_tokens`, `kv_unified:false`, `parallel`,
`flash_attention:on`, `context_size`). They intentionally stay bit-exact (no
`ssm_bf16_tau`). The full backend-split + gallery plan is in
`flash_attention:on`, `context_size`). They are bit-exact. The full
backend-split + gallery plan is in
[`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md).

View File

@@ -7,10 +7,6 @@ q36-27b-nvfp4,llama-patched,8,85.3,915.1
q36-27b-nvfp4,llama-patched,32,211.9,919.0
q36-27b-nvfp4,llama-patched,64,305.2,923.5
q36-27b-nvfp4,llama-patched,128,382.1,922.9
q36-27b-nvfp4,llama-patched-bf16tau,8,87.8,919.2
q36-27b-nvfp4,llama-patched-bf16tau,32,231.0,931.1
q36-27b-nvfp4,llama-patched-bf16tau,64,341.4,930.7
q36-27b-nvfp4,llama-patched-bf16tau,128,446.1,932.2
q36-27b-nvfp4,vllm,8,70.4,2096.2
q36-27b-nvfp4,vllm,32,211.8,2182.6
q36-27b-nvfp4,vllm,64,309.1,2088.9
@@ -23,10 +19,6 @@ q36-35b-a3b-nvfp4,llama-patched,8,230.3,1510.3
q36-35b-a3b-nvfp4,llama-patched,32,466.4,1969.2
q36-35b-a3b-nvfp4,llama-patched,64,622.4,2122.8
q36-35b-a3b-nvfp4,llama-patched,128,784.3,2177.0
q36-35b-a3b-nvfp4,llama-patched-bf16tau,8,240.5,1539.8
q36-35b-a3b-nvfp4,llama-patched-bf16tau,32,508.1,2031.7
q36-35b-a3b-nvfp4,llama-patched-bf16tau,64,703.8,2151.8
q36-35b-a3b-nvfp4,llama-patched-bf16tau,128,918.0,2212.3
q36-35b-a3b-nvfp4,vllm,8,256.5,5186.5
q36-35b-a3b-nvfp4,vllm,32,500.8,6223.4
q36-35b-a3b-nvfp4,vllm,64,686.1,5926.5
1 model engine npl decode_agg_tps prefill_tps
7 q36-27b-nvfp4 llama-patched 32 211.9 919.0
8 q36-27b-nvfp4 llama-patched 64 305.2 923.5
9 q36-27b-nvfp4 llama-patched 128 382.1 922.9
q36-27b-nvfp4 llama-patched-bf16tau 8 87.8 919.2
q36-27b-nvfp4 llama-patched-bf16tau 32 231.0 931.1
q36-27b-nvfp4 llama-patched-bf16tau 64 341.4 930.7
q36-27b-nvfp4 llama-patched-bf16tau 128 446.1 932.2
10 q36-27b-nvfp4 vllm 8 70.4 2096.2
11 q36-27b-nvfp4 vllm 32 211.8 2182.6
12 q36-27b-nvfp4 vllm 64 309.1 2088.9
19 q36-35b-a3b-nvfp4 llama-patched 32 466.4 1969.2
20 q36-35b-a3b-nvfp4 llama-patched 64 622.4 2122.8
21 q36-35b-a3b-nvfp4 llama-patched 128 784.3 2177.0
q36-35b-a3b-nvfp4 llama-patched-bf16tau 8 240.5 1539.8
q36-35b-a3b-nvfp4 llama-patched-bf16tau 32 508.1 2031.7
q36-35b-a3b-nvfp4 llama-patched-bf16tau 64 703.8 2151.8
q36-35b-a3b-nvfp4 llama-patched-bf16tau 128 918.0 2212.3
22 q36-35b-a3b-nvfp4 vllm 8 256.5 5186.5
23 q36-35b-a3b-nvfp4 vllm 32 500.8 6223.4
24 q36-35b-a3b-nvfp4 vllm 64 686.1 5926.5

View File

@@ -854,27 +854,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// If conversion fails, leave the per-slot cap unset (engine default)
}
}
// --- hybrid per-head bf16 SSM-state precision (patch 0026, qwen3.5 gated-DeltaNet decode) ---
// Opt-in reduced-precision fast mode for the recurrent SSM state: a gated-DeltaNet head whose
// memory length tau_h = 1/(|ssm_a|*softplus(ssm_dt)) tokens exceeds this threshold stays f32;
// faster-decaying heads persist their state as bf16, halving that head's dominant recurrence
// byte stream on decode. The value is the tau threshold in tokens (e.g. 32 / 64); 0 keeps every
// head f32 (the bit-exact default). Set BEFORE context init via LLAMA_SSM_BF16_TAU, consumed in
// common_context_params_to_llama (patch 0026) only when the --ssm-bf16-tau CLI flag is unset.
// Unset / non-positive => env untouched, so stock stays byte-identical and bit-exact (an
// externally exported LLAMA_SSM_BF16_TAU still works as an escape hatch). NOTE: this mode is
// NOT bit-exact (~91% same-top-p ceiling); see backend/cpp/llama-cpp-localai-paged/README.md (Dev notes).
} else if (!strcmp(optname, "ssm_bf16_tau") || !strcmp(optname, "ssm_hybrid_tau")) {
if (optval != NULL) {
try {
float tau = std::stof(optval_str);
if (tau > 0.0f) {
setenv("LLAMA_SSM_BF16_TAU", std::to_string(tau).c_str(), 1);
}
} catch (const std::exception& e) {
// If conversion fails, leave the threshold unset (bit-exact f32 default)
}
}
} else if (!strcmp(optname, "n_ctx_checkpoints") || !strcmp(optname, "ctx_checkpoints")) {
if (optval != NULL) {
try {

View File

@@ -83,10 +83,7 @@
stock llama-cpp backend, with the LocalAI paged patch series applied
(vendored in this backend). Tuned for NVFP4 dense / MoE on Blackwell / GB10. Reuses the
llama-cpp gRPC server sources; the paged engine is gated at runtime by the
paged_kv / max_batch_tokens model options. Qwen3.5 gated-DeltaNet models can
additionally opt into the reduced-precision hybrid SSM-state fast mode with
the ssm_bf16_tau:<tokens> option (default off = bit-exact f32; non-bit-exact
when enabled).
paged_kv / max_batch_tokens model options.
urls:
- https://github.com/ggerganov/llama.cpp
tags:

View File

@@ -125,7 +125,7 @@ For getting started, see the available backends in LocalAI here: https://github.
LocalAI supports various types of backends:
- **LLM Backends**: For running language models (e.g., llama.cpp, vLLM, SGLang, transformers, MLX)
- **`llama-cpp-localai-paged`**: LocalAI's paged-attention llama.cpp variant - on-demand paged KV cache plus a decode-first prefill budget, tuned for NVFP4 dense/MoE on Blackwell/GB10. Same upstream llama.cpp pin as the stock `llama-cpp` backend, reusing its gRPC server; the paged engine is enabled per-model via the `paged_kv` / `max_batch_tokens` options. For Qwen3.5 gated-DeltaNet (hybrid SSM) models you can additionally set `options: [ssm_bf16_tau:<tokens>]` to enable the reduced-precision hybrid SSM-state fast mode: fast-decaying recurrent heads (memory length tau below the threshold, e.g. `32` / `64`) persist their state as bf16, halving that head's decode byte stream. Default off (`0`) keeps every head f32 and is bit-exact; when enabled the mode is **not** bit-exact (~91% same-top-p ceiling - see `backend/cpp/llama-cpp-localai-paged/README.md` for the quality/throughput profile).
- **`llama-cpp-localai-paged`**: LocalAI's paged-attention llama.cpp variant - on-demand paged KV cache plus a decode-first prefill budget, tuned for NVFP4 dense/MoE on Blackwell/GB10. Same upstream llama.cpp pin as the stock `llama-cpp` backend, reusing its gRPC server; the paged engine is enabled per-model via the `paged_kv` / `max_batch_tokens` options.
- **Speech-to-Text Backends**: For transcription (e.g., whisper.cpp, parakeet.cpp, faster-whisper, NeMo)
- **Text-to-Speech Backends**: For speech synthesis (e.g., piper, Kokoro, VibeVoice, Qwen3-TTS)
- **Sound Generation Backends**: For music and audio generation (e.g., ACE-Step)

View File

@@ -14,14 +14,11 @@
# GGUFs were re-quantized with a newer convert (origin/master) preserving the same
# MOSTLY_NVFP4 format; a load check on the paged backend GPU build is the final gate.
#
# NOTE(ssm_bf16_tau): Qwen3.5 gated-DeltaNet (hybrid SSM) models can opt into the
# reduced-precision hybrid SSM-state fast mode by adding `ssm_bf16_tau:<tokens>`
# (e.g. 32 / 64) to a model's `options:` list - fast-decaying recurrent heads then
# persist their state as bf16 (LLAMA_SSM_BF16_TAU), halving that head's decode byte
# stream. Default off (0) = every head f32 = bit-exact; when enabled the mode is NOT
# bit-exact (~91% same-top-p, beats vLLM dense) - see
# backend/cpp/llama-cpp-localai-paged/README.md for the quality profile.
# The two NVFP4 entries below intentionally stay bit-exact (no ssm_bf16_tau).
# The two NVFP4 entries below are bit-exact (f32 SSM state). The opt-in
# reduced-precision hybrid SSM-state lever (ssm_bf16_tau, patch 0026) was DROPPED:
# clean measurements showed it flat once the decode fusions landed (forcing all
# gated-DeltaNet heads to bf16 gave 780.6 vs 780.0 t/s, zero benefit) - see
# backend/cpp/llama-cpp-localai-paged/README.md section 5.
# =============================================================================
- name: "qwen3.6-27b-nvfp4-paged"
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"