mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-29 19:06:43 -04:00
paged: drop bf16-tau (patch 0026), subsumed by decode fusions (tau=100000 flat, zero speed benefit)
The opt-in hybrid per-head bf16 SSM-state lever (ssm_bf16_tau, patch 0026) is removed from the llama-cpp-localai-paged patch series. Clean re-measurement after the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache) landed shows it buys nothing: forcing ALL gated-DeltaNet heads to bf16 (tau=100000, the most aggressive setting) gives flat decode throughput, 780.6 vs 780.0 t/s. The mode engages but adds zero speed because it is subsumed by the fusions. The earlier "+12%" was measured before the fusions completed. bf16-tau was a precision trade (not bit-exact, ~91% same-top-p) plus extra bug surface and extra CUDA template-instantiation compile cost with no offsetting benefit. Dependency check: no later patch (0028/0029/0030) depends on 0026. 0030's only mention is a description comment; its code keys off fused_gdn_ar/ch/auto_fgdn, which originate in 0018/0019/0021 (before 0026). The remaining series (0001-0025, 0028-0030) applies clean with git apply --check against the pin 0ed235ea2c17a19fc8238668653946721ed136fd. The Makefile applies the series by glob (patches/paged/0*.patch); the resulting gap at 0026 is tolerated (0005/0027 are already absent). Removed: - patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch - the dead ssm_bf16_tau / ssm_hybrid_tau option handler in the shared grpc-server.cpp (it only set LLAMA_SSM_BF16_TAU, now a no-op the library no longer reads) - the patched+bf16-tau benchmark columns and llama-patched-bf16tau rows (README + final_benchmark.csv), the ssm_bf16_tau option text in backend index.yaml, the gallery NOTE block, and the docs/features/backends.md mention. The rejected-lever lesson is kept (why it was dropped: subsumed, tau=100000 flat) in the backend README section 5, the paged-backend agent guide, and the vLLM-parity methodology, so it is not re-tried. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -42,8 +42,11 @@ how-to.
|
||||
dev doc or `*.md`. Strict `git apply` on a clean checkout must reach exit 0. (A
|
||||
stray `SSM_DECODE_FIX_RESULTS.md` hunk in patch 0019 once broke the CI build.)
|
||||
- **Bit-exact by default.** Every shipped patch is byte-identical to the f32
|
||||
baseline. The one opt-in precision trade (`ssm_bf16_tau`, patch 0026) defaults
|
||||
off; never put it in a recommended/gallery config.
|
||||
baseline. (The one opt-in precision trade, `ssm_bf16_tau` / patch 0026, was
|
||||
DROPPED: it went flat once the decode fusions landed - forcing all gated-DeltaNet
|
||||
heads to bf16 gave 780.6 vs 780.0 t/s, zero benefit - so the series is now
|
||||
bit-exact end to end. Do not reintroduce a per-head SSM-precision lever; see the
|
||||
rejected-levers note in the backend README section 5.)
|
||||
|
||||
## Maintaining the pin against new llama.cpp
|
||||
|
||||
|
||||
@@ -54,9 +54,15 @@ backend README.
|
||||
the runtime flag off (compiled-in wins survive the toggle). Cross-engine "% of vLLM"
|
||||
(batched-bench vs vLLM server+client) is *indicative*; always caveat the harness
|
||||
and config (context length alone shifted the MoE figure 76% <-> 86%).
|
||||
- **The win may be a precision trade, not a free lever.** bf16 SSM state was +12%
|
||||
but failed the f32 KL gate (vLLM keeps f32 too), so it ships default-off opt-in -
|
||||
never in a recommended config.
|
||||
- **Re-measure a "win" after later levers land - it may evaporate.** bf16 SSM
|
||||
state (the `ssm_bf16_tau` lever) benched +12% early and failed the f32 KL gate
|
||||
(vLLM keeps f32 too), so it was kept default-off opt-in. Once the decode fusions
|
||||
(recurrent-state gather-fusion + block-table cache) landed, a clean re-measure
|
||||
forcing ALL gated-DeltaNet heads to bf16 (`tau=100000`) went **flat** - 780.6 vs
|
||||
780.0 t/s. The "+12%" was subsumed by the fusions: the lever bought nothing, so
|
||||
it was **dropped** (precision trade + bug surface + extra CUDA template-instantiation
|
||||
compile cost, zero benefit). A win measured before the rest of the series is not a
|
||||
win after it.
|
||||
- **Reject the obvious-but-wrong, with evidence.** A faster kernel that is off the
|
||||
critical path benches FLAT (the freed time becomes idle). Quantizing the bf16
|
||||
projections to NVFP4 cost ~6% PPL - and vLLM keeps them bf16 for the same reason.
|
||||
|
||||
@@ -142,14 +142,22 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact
|
||||
| 0023 | **NVFP4 activation-quantize de-dup** - the broadcast up/gate projections re-quantize the same token activation once per expert; quantize the unique token activations once and byte-copy them into the expert-gathered layout. The only NVFP4-specific patch. | yes (byte-identical) |
|
||||
| 0025 | **MoE decode re-graph** - keep CUDA graphs on for the grouped-MMQ MoE decode step (the upstream guard disables graphs conservatively; the grouped path has no host sync). Env-gated `LLAMA_MOE_FORCE_GRAPHS`. | yes (graph replay re-issues identical kernels) |
|
||||
|
||||
### Pool reclaim, block-table cache, backend gate, opt-in bf16-SSM
|
||||
### Pool reclaim, block-table cache, backend gate
|
||||
|
||||
| # | What it does | Bit-exact |
|
||||
|---|---|---|
|
||||
| 0024 | **Paged-pool burst-reclaim** - truncate trailing blocks on partial-tail `seq_rm`, defrag the free queue when idle, release blocks on slot completion. Fixes the long-server burst-degradation bug (post-burst prefill collapse 488->44 t/s, restored to 532). Host-side accounting only. | yes |
|
||||
| 0029 | **Block-table within-step host cache** - the block table is fixed for the whole step; cache it on first build and memcpy it for the other full-attention layers (get_block_table -87%/-91%). | yes, per path (paged-MoE ref `8cb0ce23`) |
|
||||
| 0030 | **Fused-op backend gate** - the fused GDN / discriminated SSM_CONV ops are CUDA-family + CPU only; force them off on any non-CUDA compute backend so a Vulkan/SYCL/Metal build can't silently run the wrong plain-conv kernel. | yes on CUDA (byte-identical pre-0030); safety gate elsewhere |
|
||||
| 0026 | **Hybrid per-head bf16 SSM state (opt-in)** - `--ssm-bf16-tau` / option `ssm_bf16_tau`: fast-decaying GDN heads (memory length below the tau threshold) persist state as bf16, halving that head's decode byte stream (~+12% decode). | default tau=0 = f32 = **bit-exact**; the bf16 mode is **NOT** bit-exact (~91% same-top-p) |
|
||||
|
||||
> **Dropped: patch 0026 (hybrid per-head bf16 SSM state, `ssm_bf16_tau`).** Once
|
||||
> the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache)
|
||||
> landed, the bf16-SSM lever bought nothing: a clean re-measurement forcing **all**
|
||||
> gated-DeltaNet heads to bf16 (`tau=100000`) gives **flat** decode (780.6 vs
|
||||
> 780.0 t/s) - the mode engages but adds zero throughput because it is subsumed by
|
||||
> the fusions. It was a precision trade (not bit-exact) plus extra bug surface and
|
||||
> CUDA template-instantiation compile cost with no benefit, so it was removed. See
|
||||
> section 5 ("rejected / flat levers") for the full record.
|
||||
|
||||
---
|
||||
|
||||
@@ -164,22 +172,27 @@ swept over serving width `npl` in {8, 32, 64, 128}. Plots:
|
||||
[`qwen36_moe_decode_vs_npl.png`](docs/qwen36_moe_decode_vs_npl.png); raw data
|
||||
[`final_benchmark.csv`](docs/final_benchmark.csv).
|
||||
|
||||

|
||||

|
||||
|
||||
> **What was re-measured (2026-06-27).** The three llama columns - **stock**,
|
||||
> **patched**, and **patched+bf16-tau** - were all re-measured this session on one
|
||||
> consistent `llama-batched-bench` harness. The **vLLM** column is the
|
||||
> **prior-session reference** (kept as-is, *not* re-run this session). Per-run peak
|
||||
> The plot above also shows a third "bf16-tau" llama curve. That was the opt-in
|
||||
> `ssm_bf16_tau` lever (patch 0026), since **dropped** - a clean re-measurement
|
||||
> showed it flat once the decode fusions landed (see section 5). The numbers below
|
||||
> use only **stock** vs **patched** vs **vLLM**.
|
||||
|
||||
> **What was re-measured (2026-06-27).** The two llama columns - **stock** and
|
||||
> **patched** - were re-measured this session on one consistent
|
||||
> `llama-batched-bench` harness. The **vLLM** column is the **prior-session
|
||||
> reference** (kept as-is, *not* re-run this session). Per-run peak
|
||||
> VRAM was *not* re-captured: the GB10's unified Grace-Blackwell LPDDR5x reports
|
||||
> `[N/A]` to `nvidia-smi --query-gpu=memory.used` and the bench does not print it
|
||||
> (the memory-advantage note below is the prior-session finding).
|
||||
|
||||
### (a) + (b) Patched vs stock vs patched+bf16-tau vs vLLM
|
||||
### (a) + (b) Patched vs stock vs vLLM
|
||||
|
||||
The **stock** column is a separate, unpatched llama.cpp built at this backend's
|
||||
**exact pin (`9d5d882d`)**; the **patched** and **patched+bf16-tau** columns are
|
||||
**exact pin (`9d5d882d`)**; the **patched** column is
|
||||
the paged binary, env/flag-toggled (`LLAMA_KV_PAGED=1`, plus
|
||||
`LLAMA_MOE_FORCE_GRAPHS=1` for MoE; bf16-tau adds `--ssm-bf16-tau 64`). All three
|
||||
`LLAMA_MOE_FORCE_GRAPHS=1` for MoE). Both
|
||||
run on the **same harness**, so "x over stock" is an apples-to-apples measure of
|
||||
the patch series. (Note: the patch series' dominant SSM decode fusions are
|
||||
compiled in, not env-gated - toggling `LLAMA_KV_PAGED` alone on the *patched*
|
||||
@@ -190,36 +203,26 @@ cross-engine "% of vLLM" is **indicative, not apples-to-apples**.
|
||||
|
||||
**Dense Qwen3.6-27B-NVFP4** (decode t/s):
|
||||
|
||||
| npl | stock | patched | patched+bf16-tau | vLLM (prior) | patched x over stock | bf16-tau over patched |
|
||||
|----:|------:|--------:|-----------------:|-------------:|---------------------:|----------------------:|
|
||||
| 8 | 68.3 | 85.3 | 87.8 | 70.4 | 1.25x | +3% |
|
||||
| 32 | 119.9 | 211.9 | 231.0 | 211.8 | 1.77x | +9% |
|
||||
| 64 | 142.8 | 305.2 | 341.4 | 309.1 | 2.14x | +12% |
|
||||
| 128 | 155.1 | 382.1 | 446.1 | 418.8 | 2.46x | +17% |
|
||||
| npl | stock | patched | vLLM (prior) | patched x over stock |
|
||||
|----:|------:|--------:|-------------:|---------------------:|
|
||||
| 8 | 68.3 | 85.3 | 70.4 | 1.25x |
|
||||
| 32 | 119.9 | 211.9 | 211.8 | 1.77x |
|
||||
| 64 | 142.8 | 305.2 | 309.1 | 2.14x |
|
||||
| 128 | 155.1 | 382.1 | 418.8 | 2.46x |
|
||||
|
||||
Dense **patched** is parity-to-ahead of vLLM (121 / 100 / 99 / 91% of vLLM across
|
||||
the widths); **patched+bf16-tau** is **ahead of vLLM at every width** (125 / 109 /
|
||||
110 / 107%).
|
||||
the widths).
|
||||
|
||||
**MoE Qwen3.6-35B-A3B-NVFP4** (decode t/s):
|
||||
|
||||
| npl | stock | patched | patched+bf16-tau | vLLM (prior) | patched x over stock | bf16-tau over patched |
|
||||
|----:|------:|--------:|-----------------:|-------------:|---------------------:|----------------------:|
|
||||
| 8 | 186.7 | 230.3 | 240.5 | 256.5 | 1.23x | +4% |
|
||||
| 32 | 267.4 | 466.4 | 508.1 | 500.8 | 1.74x | +9% |
|
||||
| 64 | 320.5 | 622.4 | 703.8 | 686.1 | 1.94x | +13% |
|
||||
| 128 | 347.2 | 784.3 | 918.0 | 882.2 | 2.26x | +17% |
|
||||
| npl | stock | patched | vLLM (prior) | patched x over stock |
|
||||
|----:|------:|--------:|-------------:|---------------------:|
|
||||
| 8 | 186.7 | 230.3 | 256.5 | 1.23x |
|
||||
| 32 | 267.4 | 466.4 | 500.8 | 1.74x |
|
||||
| 64 | 320.5 | 622.4 | 686.1 | 1.94x |
|
||||
| 128 | 347.2 | 784.3 | 882.2 | 2.26x |
|
||||
|
||||
MoE **patched** is 90 / 93 / 91 / 89% of vLLM; **patched+bf16-tau** reaches
|
||||
parity-to-ahead (94 / 101 / 103 / 104%) at npl>=32.
|
||||
|
||||
**On bf16-tau.** The `patched+bf16-tau` column uses `--ssm-bf16-tau 64` (no exact
|
||||
tau was recorded behind the documented "~+12%"; the flag help suggests 32/64, and
|
||||
64 bf16's more of the fast-decaying GDN heads). It is **opt-in and NOT bit-exact**
|
||||
(~91% same-top-p; see section 5) - it persists fast-decaying GDN head state as
|
||||
bf16 to halve that head's recurrence byte stream. Measured decode gain over
|
||||
patched grows with serving width: **+3-4% at npl8, ~+12-13% at npl64, +17% at
|
||||
npl128** (dense and MoE alike).
|
||||
MoE **patched** is 90 / 93 / 91 / 89% of vLLM.
|
||||
|
||||
**Caveat on the vLLM column.** It is a **different harness** and a
|
||||
**prior-session** measurement (not re-run this session), so the cross-engine "% of
|
||||
@@ -229,10 +232,8 @@ vLLM" is **indicative, not apples-to-apples**. Memory (prior session): llama use
|
||||
**Takeaway.** Re-measured this session, the patch series gives up to **2.46x
|
||||
(dense) / 2.26x (MoE)** over true-stock `9d5d882d` on the same harness (close to,
|
||||
slightly below, the prior 2.59x / 2.33x - llama was re-measured, vLLM kept).
|
||||
Opt-in **bf16-tau adds a further +3% to +17%** on top of patched (growing with
|
||||
width). Dense is **ahead of vLLM** with bf16-tau at every width; MoE **patched**
|
||||
sits at ~89-93% of the prior-session vLLM and **bf16-tau reaches parity-to-ahead**
|
||||
at npl>=32. The residual non-bf16 MoE gap is structural (see section 5).
|
||||
Dense is parity-to-ahead of vLLM; MoE **patched** sits at ~89-93% of the
|
||||
prior-session vLLM. The residual MoE gap is structural (see section 5).
|
||||
|
||||
### (c) Apple Silicon (M4, 16GB Metal) - does the patchset help here?
|
||||
|
||||
@@ -314,14 +315,20 @@ llama is losing. The MoE GEMM kernel is *not* where the gap lives.
|
||||
(The "the win was NVFP4-dense-quant, not the Marlin kernel" dense verdict
|
||||
carries over to MoE.)
|
||||
|
||||
**Opt-in bf16-SSM fast mode** (patch 0026, `ssm_bf16_tau`). The design premise -
|
||||
that bf16 KL error concentrates in long-memory heads and can be removed by
|
||||
keeping them f32 - is **empirically refuted**: the error scales with the bf16
|
||||
**Opt-in bf16-SSM fast mode - DROPPED (was patch 0026, `ssm_bf16_tau`).** The
|
||||
design premise - that bf16 KL error concentrates in long-memory heads and can be
|
||||
removed by keeping them f32 - was already shaky: the error scales with the bf16
|
||||
head *count* and saturates (~0.06 MeanKLD / ~91% same-top-p) far below any useful
|
||||
byte saving, and the carry is byte-exact (genuine bf16 rounding, not a bug). The
|
||||
byte-saving (and ~+12% decode) is real but cannot meet a strict KL bar, so it
|
||||
ships **default-off (f32, bit-exact)** and opt-in only. Do not put a hybrid tau
|
||||
in a recommended/gallery config.
|
||||
byte saving. The lever was then **removed entirely** once the decode fusions
|
||||
(0028 recurrent-state gather-fusion + 0029 block-table cache) landed: a clean
|
||||
re-measurement that forced **all** gated-DeltaNet heads to bf16 (`tau=100000`,
|
||||
the most aggressive setting) gave **flat** decode throughput - **780.6 vs 780.0
|
||||
t/s**. The mode engages but buys **zero** speed; the earlier "+12%" was subsumed
|
||||
by the fusions. So bf16-tau was a precision trade (not bit-exact) plus extra bug
|
||||
surface and CUDA template-instantiation compile cost with **no** offsetting
|
||||
benefit, and patch 0026 was dropped from the series. Lesson recorded so it is not
|
||||
re-tried: do not reintroduce a per-head SSM-precision lever - the bandwidth it
|
||||
targeted is already recovered by the gather-fusion + block-table cache.
|
||||
|
||||
---
|
||||
|
||||
@@ -403,6 +410,6 @@ The benchmarked NVFP4 GGUFs are published and wired into the LocalAI gallery:
|
||||
|
||||
Both gallery entries set `backend: llama-cpp-localai-paged` and the paged serving config
|
||||
(`paged_kv:true`, `max_batch_tokens`, `kv_unified:false`, `parallel`,
|
||||
`flash_attention:on`, `context_size`). They intentionally stay bit-exact (no
|
||||
`ssm_bf16_tau`). The full backend-split + gallery plan is in
|
||||
`flash_attention:on`, `context_size`). They are bit-exact. The full
|
||||
backend-split + gallery plan is in
|
||||
[`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md).
|
||||
|
||||
@@ -7,10 +7,6 @@ q36-27b-nvfp4,llama-patched,8,85.3,915.1
|
||||
q36-27b-nvfp4,llama-patched,32,211.9,919.0
|
||||
q36-27b-nvfp4,llama-patched,64,305.2,923.5
|
||||
q36-27b-nvfp4,llama-patched,128,382.1,922.9
|
||||
q36-27b-nvfp4,llama-patched-bf16tau,8,87.8,919.2
|
||||
q36-27b-nvfp4,llama-patched-bf16tau,32,231.0,931.1
|
||||
q36-27b-nvfp4,llama-patched-bf16tau,64,341.4,930.7
|
||||
q36-27b-nvfp4,llama-patched-bf16tau,128,446.1,932.2
|
||||
q36-27b-nvfp4,vllm,8,70.4,2096.2
|
||||
q36-27b-nvfp4,vllm,32,211.8,2182.6
|
||||
q36-27b-nvfp4,vllm,64,309.1,2088.9
|
||||
@@ -23,10 +19,6 @@ q36-35b-a3b-nvfp4,llama-patched,8,230.3,1510.3
|
||||
q36-35b-a3b-nvfp4,llama-patched,32,466.4,1969.2
|
||||
q36-35b-a3b-nvfp4,llama-patched,64,622.4,2122.8
|
||||
q36-35b-a3b-nvfp4,llama-patched,128,784.3,2177.0
|
||||
q36-35b-a3b-nvfp4,llama-patched-bf16tau,8,240.5,1539.8
|
||||
q36-35b-a3b-nvfp4,llama-patched-bf16tau,32,508.1,2031.7
|
||||
q36-35b-a3b-nvfp4,llama-patched-bf16tau,64,703.8,2151.8
|
||||
q36-35b-a3b-nvfp4,llama-patched-bf16tau,128,918.0,2212.3
|
||||
q36-35b-a3b-nvfp4,vllm,8,256.5,5186.5
|
||||
q36-35b-a3b-nvfp4,vllm,32,500.8,6223.4
|
||||
q36-35b-a3b-nvfp4,vllm,64,686.1,5926.5
|
||||
|
||||
|
File diff suppressed because it is too large
Load Diff
@@ -854,27 +854,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
|
||||
// If conversion fails, leave the per-slot cap unset (engine default)
|
||||
}
|
||||
}
|
||||
// --- hybrid per-head bf16 SSM-state precision (patch 0026, qwen3.5 gated-DeltaNet decode) ---
|
||||
// Opt-in reduced-precision fast mode for the recurrent SSM state: a gated-DeltaNet head whose
|
||||
// memory length tau_h = 1/(|ssm_a|*softplus(ssm_dt)) tokens exceeds this threshold stays f32;
|
||||
// faster-decaying heads persist their state as bf16, halving that head's dominant recurrence
|
||||
// byte stream on decode. The value is the tau threshold in tokens (e.g. 32 / 64); 0 keeps every
|
||||
// head f32 (the bit-exact default). Set BEFORE context init via LLAMA_SSM_BF16_TAU, consumed in
|
||||
// common_context_params_to_llama (patch 0026) only when the --ssm-bf16-tau CLI flag is unset.
|
||||
// Unset / non-positive => env untouched, so stock stays byte-identical and bit-exact (an
|
||||
// externally exported LLAMA_SSM_BF16_TAU still works as an escape hatch). NOTE: this mode is
|
||||
// NOT bit-exact (~91% same-top-p ceiling); see backend/cpp/llama-cpp-localai-paged/README.md (Dev notes).
|
||||
} else if (!strcmp(optname, "ssm_bf16_tau") || !strcmp(optname, "ssm_hybrid_tau")) {
|
||||
if (optval != NULL) {
|
||||
try {
|
||||
float tau = std::stof(optval_str);
|
||||
if (tau > 0.0f) {
|
||||
setenv("LLAMA_SSM_BF16_TAU", std::to_string(tau).c_str(), 1);
|
||||
}
|
||||
} catch (const std::exception& e) {
|
||||
// If conversion fails, leave the threshold unset (bit-exact f32 default)
|
||||
}
|
||||
}
|
||||
} else if (!strcmp(optname, "n_ctx_checkpoints") || !strcmp(optname, "ctx_checkpoints")) {
|
||||
if (optval != NULL) {
|
||||
try {
|
||||
|
||||
@@ -83,10 +83,7 @@
|
||||
stock llama-cpp backend, with the LocalAI paged patch series applied
|
||||
(vendored in this backend). Tuned for NVFP4 dense / MoE on Blackwell / GB10. Reuses the
|
||||
llama-cpp gRPC server sources; the paged engine is gated at runtime by the
|
||||
paged_kv / max_batch_tokens model options. Qwen3.5 gated-DeltaNet models can
|
||||
additionally opt into the reduced-precision hybrid SSM-state fast mode with
|
||||
the ssm_bf16_tau:<tokens> option (default off = bit-exact f32; non-bit-exact
|
||||
when enabled).
|
||||
paged_kv / max_batch_tokens model options.
|
||||
urls:
|
||||
- https://github.com/ggerganov/llama.cpp
|
||||
tags:
|
||||
|
||||
@@ -125,7 +125,7 @@ For getting started, see the available backends in LocalAI here: https://github.
|
||||
LocalAI supports various types of backends:
|
||||
|
||||
- **LLM Backends**: For running language models (e.g., llama.cpp, vLLM, SGLang, transformers, MLX)
|
||||
- **`llama-cpp-localai-paged`**: LocalAI's paged-attention llama.cpp variant - on-demand paged KV cache plus a decode-first prefill budget, tuned for NVFP4 dense/MoE on Blackwell/GB10. Same upstream llama.cpp pin as the stock `llama-cpp` backend, reusing its gRPC server; the paged engine is enabled per-model via the `paged_kv` / `max_batch_tokens` options. For Qwen3.5 gated-DeltaNet (hybrid SSM) models you can additionally set `options: [ssm_bf16_tau:<tokens>]` to enable the reduced-precision hybrid SSM-state fast mode: fast-decaying recurrent heads (memory length tau below the threshold, e.g. `32` / `64`) persist their state as bf16, halving that head's decode byte stream. Default off (`0`) keeps every head f32 and is bit-exact; when enabled the mode is **not** bit-exact (~91% same-top-p ceiling - see `backend/cpp/llama-cpp-localai-paged/README.md` for the quality/throughput profile).
|
||||
- **`llama-cpp-localai-paged`**: LocalAI's paged-attention llama.cpp variant - on-demand paged KV cache plus a decode-first prefill budget, tuned for NVFP4 dense/MoE on Blackwell/GB10. Same upstream llama.cpp pin as the stock `llama-cpp` backend, reusing its gRPC server; the paged engine is enabled per-model via the `paged_kv` / `max_batch_tokens` options.
|
||||
- **Speech-to-Text Backends**: For transcription (e.g., whisper.cpp, parakeet.cpp, faster-whisper, NeMo)
|
||||
- **Text-to-Speech Backends**: For speech synthesis (e.g., piper, Kokoro, VibeVoice, Qwen3-TTS)
|
||||
- **Sound Generation Backends**: For music and audio generation (e.g., ACE-Step)
|
||||
|
||||
@@ -14,14 +14,11 @@
|
||||
# GGUFs were re-quantized with a newer convert (origin/master) preserving the same
|
||||
# MOSTLY_NVFP4 format; a load check on the paged backend GPU build is the final gate.
|
||||
#
|
||||
# NOTE(ssm_bf16_tau): Qwen3.5 gated-DeltaNet (hybrid SSM) models can opt into the
|
||||
# reduced-precision hybrid SSM-state fast mode by adding `ssm_bf16_tau:<tokens>`
|
||||
# (e.g. 32 / 64) to a model's `options:` list - fast-decaying recurrent heads then
|
||||
# persist their state as bf16 (LLAMA_SSM_BF16_TAU), halving that head's decode byte
|
||||
# stream. Default off (0) = every head f32 = bit-exact; when enabled the mode is NOT
|
||||
# bit-exact (~91% same-top-p, beats vLLM dense) - see
|
||||
# backend/cpp/llama-cpp-localai-paged/README.md for the quality profile.
|
||||
# The two NVFP4 entries below intentionally stay bit-exact (no ssm_bf16_tau).
|
||||
# The two NVFP4 entries below are bit-exact (f32 SSM state). The opt-in
|
||||
# reduced-precision hybrid SSM-state lever (ssm_bf16_tau, patch 0026) was DROPPED:
|
||||
# clean measurements showed it flat once the decode fusions landed (forcing all
|
||||
# gated-DeltaNet heads to bf16 gave 780.6 vs 780.0 t/s, zero benefit) - see
|
||||
# backend/cpp/llama-cpp-localai-paged/README.md section 5.
|
||||
# =============================================================================
|
||||
- name: "qwen3.6-27b-nvfp4-paged"
|
||||
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
|
||||
|
||||
Reference in New Issue
Block a user