mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 09:57:14 -04:00
feat(paged): wire ssm_bf16_tau model option for hybrid SSM-state fast mode
Patch 0026 added the hybrid per-head bf16 SSM-state opt-in as the ssm_hybrid_tau_thresh cparam + the --ssm-bf16-tau CLI flag (default 0 = bit-exact f32). Expose it per-model via the LocalAI gallery/model YAML `options:` list, mirroring the paged_kv / max_batch_tokens setenv hooks. - grpc-server.cpp: new `ssm_bf16_tau` (alias `ssm_hybrid_tau`) option -> setenv(LLAMA_SSM_BF16_TAU) when the value parses to a positive float. It does NOT reference the paged-only common_params field, so the turboquant fork (which lacks patch 0026) stays byte-clean. - patch 0026 (common.cpp common_context_params_to_llama): getenv fallback feeds cparams.ssm_hybrid_tau_thresh from LLAMA_SSM_BF16_TAU only when the --ssm-bf16-tau CLI flag is unset (0). Absent/non-positive env => untouched, so stock stays bit-exact; the CLI flag takes precedence when set. - docs: backend/index.yaml note, docs backends.md, gallery header NOTE (referencing A_HYBRID_SSM_RESULTS.md; the 2 NVFP4 entries stay bit-exact). Byte-safe when unset: with no ssm_bf16_tau option the env is never touched and the default f32 bit-exact recurrence is preserved. Verified the parse + consume code paths with a standalone compile-and-run (option string -> LLAMA_SSM_BF16_TAU -> tau, plus 0 / garbage / CLI-precedence / unset cases). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -840,6 +840,27 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
|
||||
// If conversion fails, leave the per-slot cap unset (engine default)
|
||||
}
|
||||
}
|
||||
// --- hybrid per-head bf16 SSM-state precision (patch 0026, qwen3.5 gated-DeltaNet decode) ---
|
||||
// Opt-in reduced-precision fast mode for the recurrent SSM state: a gated-DeltaNet head whose
|
||||
// memory length tau_h = 1/(|ssm_a|*softplus(ssm_dt)) tokens exceeds this threshold stays f32;
|
||||
// faster-decaying heads persist their state as bf16, halving that head's dominant recurrence
|
||||
// byte stream on decode. The value is the tau threshold in tokens (e.g. 32 / 64); 0 keeps every
|
||||
// head f32 (the bit-exact default). Set BEFORE context init via LLAMA_SSM_BF16_TAU, consumed in
|
||||
// common_context_params_to_llama (patch 0026) only when the --ssm-bf16-tau CLI flag is unset.
|
||||
// Unset / non-positive => env untouched, so stock stays byte-identical and bit-exact (an
|
||||
// externally exported LLAMA_SSM_BF16_TAU still works as an escape hatch). NOTE: this mode is
|
||||
// NOT bit-exact (~91% same-top-p ceiling); see patches/paged/A_HYBRID_SSM_RESULTS.md.
|
||||
} else if (!strcmp(optname, "ssm_bf16_tau") || !strcmp(optname, "ssm_hybrid_tau")) {
|
||||
if (optval != NULL) {
|
||||
try {
|
||||
float tau = std::stof(optval_str);
|
||||
if (tau > 0.0f) {
|
||||
setenv("LLAMA_SSM_BF16_TAU", std::to_string(tau).c_str(), 1);
|
||||
}
|
||||
} catch (const std::exception& e) {
|
||||
// If conversion fails, leave the threshold unset (bit-exact f32 default)
|
||||
}
|
||||
}
|
||||
} else if (!strcmp(optname, "n_ctx_checkpoints") || !strcmp(optname, "ctx_checkpoints")) {
|
||||
if (optval != NULL) {
|
||||
try {
|
||||
|
||||
@@ -54,13 +54,23 @@ diff --git a/common/common.cpp b/common/common.cpp
|
||||
index a14e7bb..c4ab884 100644
|
||||
--- a/common/common.cpp
|
||||
+++ b/common/common.cpp
|
||||
@@ -1600,6 +1600,9 @@ struct llama_context_params common_context_params_to_llama(const common_params &
|
||||
@@ -1600,6 +1600,19 @@ struct llama_context_params common_context_params_to_llama(const common_params &
|
||||
|
||||
cparams.type_k = params.cache_type_k;
|
||||
cparams.type_v = params.cache_type_v;
|
||||
+ cparams.type_r = params.cache_type_conv;
|
||||
+ cparams.type_s = params.cache_type_ssm;
|
||||
+ cparams.ssm_hybrid_tau_thresh = params.ssm_hybrid_tau_thresh;
|
||||
+ // LocalAI per-model option hook: when the --ssm-bf16-tau CLI flag is at its bit-exact
|
||||
+ // default (0), honor LLAMA_SSM_BF16_TAU (set by the grpc-server from the model YAML
|
||||
+ // `options: [ssm_bf16_tau:N]`) so the reduced-precision hybrid fast mode is selectable
|
||||
+ // per model without a process-wide CLI flag. Absent/non-positive env => untouched, so
|
||||
+ // stock stays bit-exact; the CLI flag, when set, takes precedence.
|
||||
+ if (cparams.ssm_hybrid_tau_thresh == 0.0f) {
|
||||
+ if (const char * tau_env = std::getenv("LLAMA_SSM_BF16_TAU")) {
|
||||
+ try { cparams.ssm_hybrid_tau_thresh = std::stof(tau_env); } catch (...) {}
|
||||
+ }
|
||||
+ }
|
||||
|
||||
return cparams;
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user