Files
LocalAI/core/config/serving_defaults.go
LocalAI [bot] aef10723c9 feat(config): prefix caching default + consolidate scattered defaults (#10415)
* feat(config): enable cross-request prefix caching for serving (Phase 2)

The llama.cpp backend ships n_cache_reuse=0 (cross-request KV prefix reuse via
shifting disabled). Enable it by default (256) so repeated prefixes - system
prompts, RAG context, agent scaffolds, multi-turn chat - aren't recomputed. This
is the universally-useful part of 'paged attention' (shared-prefix reuse, which
the upstream maintainers themselves identify as where paged attn actually helps)
and needs none of the block-KV machinery.

Lives in a serving_defaults.go sibling to hardware_defaults.go (device-driven vs
serving-policy defaults); both run from SetDefaults and only fill unset values.
Explicit cache_reuse/n_cache_reuse always wins. Device-independent, so it
propagates to distributed nodes via the model options with no router change.
Shares the backendOptionSet helper with the Phase-1 parallel default.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(config): extract generic fallback defaults into ApplyGenericDefaults

Behavior-preserving: move the inline sampling-param + runtime-flag fallbacks out
of SetDefaults into ApplyGenericDefaults, completing the domain-grouped tiers
(ApplyInferenceDefaults=family, ApplyHardwareDefaults=device, ApplyServingDefaults
=serving, ApplyGenericDefaults=generic fallbacks). SetDefaults is now a clean
orchestrator. Same order (runs after the family/hardware/serving tiers so those
win) and same conditions (TopK gated on UsesLlamaSamplerDefaults, MMap on XPU).
No behavior change; full config suite green. (NGPULayers stays in the GGUF-read
path for now - it's device-driven but coupled to model-size detection; a separate
follow-up.)

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 22:44:44 +02:00

57 lines
1.9 KiB
Go

package config
import (
"fmt"
"strings"
"github.com/mudler/xlog"
)
// Serving-policy model-config defaults.
//
// Sibling to hardware_defaults.go: those fill values driven by the target
// *device* (Blackwell batch, VRAM-scaled parallel slots); these fill values
// that improve multi-request / multi-user *serving* regardless of the GPU. They
// run together from SetDefaults and only ever fill values the user left unset.
// DefaultCacheReuse is the minimum shared-prefix chunk (in tokens) the backend
// reuses across requests via KV-cache shifting. The llama.cpp backend ships this
// disabled (n_cache_reuse = 0); we enable it so repeated prefixes (system
// prompts, RAG context, agent scaffolds, multi-turn chat) are not recomputed.
// This is the universally-useful part of "paged attention" (cross-request prefix
// sharing) and needs none of the block-KV machinery.
const DefaultCacheReuse = 256
// ApplyServingDefaults fills serving-policy ModelConfig values the user left
// unset. Currently: enable cross-request prefix caching. Explicit
// cache_reuse/n_cache_reuse in the model options always wins.
func ApplyServingDefaults(cfg *ModelConfig) {
if cfg == nil {
return
}
if !backendOptionSet(cfg.Options, "cache_reuse", "n_cache_reuse") {
cfg.Options = append(cfg.Options, fmt.Sprintf("cache_reuse:%d", DefaultCacheReuse))
xlog.Debug("[serving_defaults] enabling cross-request prefix cache",
"cache_reuse", DefaultCacheReuse)
}
}
// backendOptionSet reports whether the backend options already set any of names.
// Options are "name:value" strings (or bare "name"); used so we never override
// an explicit value. Shared with hardware_defaults.go.
func backendOptionSet(opts []string, names ...string) bool {
for _, o := range opts {
name := o
if i := strings.IndexByte(o, ':'); i >= 0 {
name = o[:i]
}
name = strings.TrimSpace(strings.ToLower(name))
for _, n := range names {
if name == n {
return true
}
}
}
return false
}