mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-22 07:39:02 -04:00
* feat(config): enable cross-request prefix caching for serving (Phase 2) The llama.cpp backend ships n_cache_reuse=0 (cross-request KV prefix reuse via shifting disabled). Enable it by default (256) so repeated prefixes - system prompts, RAG context, agent scaffolds, multi-turn chat - aren't recomputed. This is the universally-useful part of 'paged attention' (shared-prefix reuse, which the upstream maintainers themselves identify as where paged attn actually helps) and needs none of the block-KV machinery. Lives in a serving_defaults.go sibling to hardware_defaults.go (device-driven vs serving-policy defaults); both run from SetDefaults and only fill unset values. Explicit cache_reuse/n_cache_reuse always wins. Device-independent, so it propagates to distributed nodes via the model options with no router change. Shares the backendOptionSet helper with the Phase-1 parallel default. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(config): extract generic fallback defaults into ApplyGenericDefaults Behavior-preserving: move the inline sampling-param + runtime-flag fallbacks out of SetDefaults into ApplyGenericDefaults, completing the domain-grouped tiers (ApplyInferenceDefaults=family, ApplyHardwareDefaults=device, ApplyServingDefaults =serving, ApplyGenericDefaults=generic fallbacks). SetDefaults is now a clean orchestrator. Same order (runs after the family/hardware/serving tiers so those win) and same conditions (TopK gated on UsesLlamaSamplerDefaults, MMap on XPU). No behavior change; full config suite green. (NGPULayers stays in the GGUF-read path for now - it's device-driven but coupled to model-size detection; a separate follow-up.) Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
57 lines
1.9 KiB
Go
57 lines
1.9 KiB
Go
package config
|
|
|
|
import (
|
|
"fmt"
|
|
"strings"
|
|
|
|
"github.com/mudler/xlog"
|
|
)
|
|
|
|
// Serving-policy model-config defaults.
|
|
//
|
|
// Sibling to hardware_defaults.go: those fill values driven by the target
|
|
// *device* (Blackwell batch, VRAM-scaled parallel slots); these fill values
|
|
// that improve multi-request / multi-user *serving* regardless of the GPU. They
|
|
// run together from SetDefaults and only ever fill values the user left unset.
|
|
|
|
// DefaultCacheReuse is the minimum shared-prefix chunk (in tokens) the backend
|
|
// reuses across requests via KV-cache shifting. The llama.cpp backend ships this
|
|
// disabled (n_cache_reuse = 0); we enable it so repeated prefixes (system
|
|
// prompts, RAG context, agent scaffolds, multi-turn chat) are not recomputed.
|
|
// This is the universally-useful part of "paged attention" (cross-request prefix
|
|
// sharing) and needs none of the block-KV machinery.
|
|
const DefaultCacheReuse = 256
|
|
|
|
// ApplyServingDefaults fills serving-policy ModelConfig values the user left
|
|
// unset. Currently: enable cross-request prefix caching. Explicit
|
|
// cache_reuse/n_cache_reuse in the model options always wins.
|
|
func ApplyServingDefaults(cfg *ModelConfig) {
|
|
if cfg == nil {
|
|
return
|
|
}
|
|
if !backendOptionSet(cfg.Options, "cache_reuse", "n_cache_reuse") {
|
|
cfg.Options = append(cfg.Options, fmt.Sprintf("cache_reuse:%d", DefaultCacheReuse))
|
|
xlog.Debug("[serving_defaults] enabling cross-request prefix cache",
|
|
"cache_reuse", DefaultCacheReuse)
|
|
}
|
|
}
|
|
|
|
// backendOptionSet reports whether the backend options already set any of names.
|
|
// Options are "name:value" strings (or bare "name"); used so we never override
|
|
// an explicit value. Shared with hardware_defaults.go.
|
|
func backendOptionSet(opts []string, names ...string) bool {
|
|
for _, o := range opts {
|
|
name := o
|
|
if i := strings.IndexByte(o, ':'); i >= 0 {
|
|
name = o[:i]
|
|
}
|
|
name = strings.TrimSpace(strings.ToLower(name))
|
|
for _, n := range names {
|
|
if name == n {
|
|
return true
|
|
}
|
|
}
|
|
}
|
|
return false
|
|
}
|