mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-01 11:56:57 -04:00
LocalAI enables a cross-request prompt-prefix cache (cache_reuse, see core/config/serving_defaults.go) so repeated prefixes — system prompts, RAG context, agent scaffolds, multi-turn chat — are not reprocessed every turn. For sliding-window-attention (SWA) models (Gemma 2/3, Cohere2, Llama 4, ...) this silently does nothing: llama.cpp defaults to a reduced SWA KV cache sized to the sliding window, and that reduced cache cannot preserve a prompt prefix across requests, so every turn reprocesses the whole prompt anyway. llama.cpp's --swa-full (params.swa_full, already wired through the LocalAI llama.cpp backend's `swa_full` option) keeps the full KV cache so the shared prefix is reused. Enable it automatically, but only for models that are actually SWA: detection reads the gguf-parser-normalized `<arch>.attention.sliding_window` metadata (which also applies llama.cpp's family rules, e.g. Phi-3 → not SWA), right where the GGUF is already parsed for defaults. It is never applied to dense models (pure memory waste) and never overrides an explicit user `swa_full`/`n_swa` choice. Tradeoff: the full SWA cache scales with context_size, so it costs more memory at large contexts — hence the SWA gating and the documented `swa_full:false` opt-out. Assisted-by: Claude:claude-opus-4-8 [Claude Code] golangci-lint Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
57 lines
2.6 KiB
Go
57 lines
2.6 KiB
Go
package config
|
|
|
|
import (
|
|
gguf "github.com/gpustack/gguf-parser-go"
|
|
"github.com/mudler/xlog"
|
|
)
|
|
|
|
// swaCacheOptionNames lists the backend option keys that control the
|
|
// sliding-window-attention KV cache. If the user pinned any of these we leave
|
|
// the SWA cache alone instead of forcing swa_full.
|
|
var swaCacheOptionNames = []string{"swa_full", "n_swa"}
|
|
|
|
// HasSlidingWindowAttention reports whether the parsed GGUF describes a
|
|
// sliding-window-attention (SWA) model — Gemma 2/3, Cohere2, Llama 4 and the
|
|
// like. The gguf-parser library normalizes the per-architecture
|
|
// `<arch>.attention.sliding_window` metadata key into
|
|
// GGUFArchitecture.AttentionSlidingWindow, applying the same family-specific
|
|
// rules llama.cpp uses (e.g. Phi-3 carries the key but does not actually run
|
|
// SWA, and is normalized to 0). A non-zero window means the model interleaves
|
|
// SWA layers, so the returned size is also the diagnostic value we log.
|
|
func HasSlidingWindowAttention(f *gguf.GGUFFile) (uint64, bool) {
|
|
if f == nil {
|
|
return 0, false
|
|
}
|
|
w := f.Architecture().AttentionSlidingWindow
|
|
return w, w > 0
|
|
}
|
|
|
|
// ApplySWAFullDefault enables the full-size SWA KV cache (swa_full:true) for a
|
|
// sliding-window model, unless the user already pinned an SWA cache option.
|
|
//
|
|
// Why: llama.cpp defaults to a reduced SWA KV cache sized to the sliding window
|
|
// (memory-light), but that reduced cache cannot preserve a prompt prefix across
|
|
// requests. So for SWA models the cross-request prefix cache we enable in
|
|
// serving_defaults.go (cache_reuse) is silently defeated — every turn
|
|
// reprocesses the entire prompt. Setting swa_full:true makes llama.cpp keep the
|
|
// full KV cache so the shared prefix is actually reused.
|
|
//
|
|
// The tradeoff is memory: the full SWA cache scales with context_size, so this
|
|
// is gated to models that are genuinely SWA (never applied to dense models,
|
|
// where it would only waste memory) and never overrides an explicit user
|
|
// choice. `slidingWindow` is the value read from the GGUF and is used only for
|
|
// the diagnostic log line.
|
|
func ApplySWAFullDefault(cfg *ModelConfig, slidingWindow uint64) {
|
|
if cfg == nil || slidingWindow == 0 {
|
|
return
|
|
}
|
|
if backendOptionSet(cfg.Options, swaCacheOptionNames...) {
|
|
xlog.Debug("[swa] sliding-window model but an SWA cache option is already set; leaving user choice intact",
|
|
"name", cfg.Name, "sliding_window", slidingWindow)
|
|
return
|
|
}
|
|
cfg.Options = append(cfg.Options, "swa_full:true")
|
|
xlog.Debug("[swa] enabling swa_full for sliding-window model so the cross-request prompt-prefix cache survives (reduced SWA cache cannot reuse a prefix across requests)",
|
|
"name", cfg.Name, "sliding_window", slidingWindow)
|
|
}
|