Files
LocalAI/core/config/runtime_settings.go
Leoy b50b1fe418 feat(watchdog): add size-aware LRU eviction mode (#9527)
* feat(watchdog): add size-aware LRU eviction mode

When the model count hits the LRU limit or the memory reclaimer fires,
evict the largest model by on-disk file size first rather than the
least-recently-used one.  For GGUF models the file size is a reliable
proxy for GPU/RAM footprint, so evicting the largest candidate maximises
freed memory per eviction round while keeping small utility models
(embeddings, classifiers, rerankers) resident.

Changes:
- `pkg/model/watchdog.go`: add `sizeAwareEviction` flag and
  `modelSizes map[string]int64` to `WatchDog`; sort candidates by
  `sizeBytes` desc (LRU time as tiebreaker) when the flag is set;
  add `RegisterModelSize`, `SetSizeAwareEviction`, `GetSizeAwareEviction`
- `pkg/model/watchdog_options.go`: add `WithSizeAwareEviction` option
- `pkg/model/initializers.go`: stat model file after load and call
  `RegisterModelSize` so size data is available before the first eviction
- `core/config/application_config.go`, `runtime_settings.go`: add
  `SizeAwareEviction` field and `WithSizeAwareEviction` app option;
  expose via `ToRuntimeSettings` / `ApplyRuntimeSettings` for the
  `POST /api/settings` live-reload path
- `core/cli/run.go`: add `--size-aware-eviction` flag /
  `LOCALAI_SIZE_AWARE_EVICTION` env var
- `core/application/startup.go`, `watchdog.go`: wire the new option
  through to `NewWatchDog`
- `pkg/model/watchdog_test.go`: 5 new specs — option enable, dynamic
  toggle, largest-first ordering, equal-size LRU tiebreaker, no-size
  fallback to LRU, and size-map cleanup on eviction

Closes #9375

Signed-off-by: supermario_leo <leo.stack@outlook.com>

* refactor(watchdog): use vram estimation scaffolding for model size

Replace the brittle os.Stat(modelFile) approach with a proper call to
pkg/vram, which handles multi-file models (DownloadFiles, MMProj) and
all weight file types, not just single GGUF files.

- Add estimateModelSizeBytes() in core/backend/options.go that collects
  all weight file URIs from the model config, resolves them to file://
  URIs, and calls vram.Estimate() with the shared DefaultCachedSizeResolver
  (15-min TTL cache avoids redundant stat calls on repeated loads)
- Thread the result through via a new WithModelSizeBytes() loader option
- In initializers.go, consume the pre-computed size instead of calling
  os.Stat; if no size was supplied (e.g. for external/router-dispatched
  models) the registration is simply skipped

Signed-off-by: supermario_leo <leo.stack@outlook.com>

* refactor(watchdog): use EstimateModel with HF fallback for size estimation

Switch estimateModelSizeBytes from calling vram.Estimate directly to the
unified vram.EstimateModel entry point, which adds automatic fallbacks:
file-based GGUF metadata → HF API → size string.

Also extract the HuggingFace repo ID from model URIs (huggingface://,
hf://, https://huggingface.co/ and org/model short-form) and pass it
as ModelEstimateInput.HFRepo, so models not yet downloaded locally can
still get a size estimate via the HF API.

Addresses @mudler's review feedback: "better to rely on EstimateModel
and pass by the HF URL of the model extracted from the URI".

Signed-off-by: supermario_leo <leo.stack@outlook.com>

* feat(webui): add Size-Aware Eviction toggle to settings page

The size-aware eviction setting was wired through the CLI flag and the
RuntimeSettings live-reload path (POST /api/settings) but had no handle
on the React settings page, so it could not be toggled from the UI.

Add a Size-Aware Eviction toggle to the Watchdog section, next to the
existing Force Eviction When Busy / LRU eviction handles. The settings
page loads and saves the whole RuntimeSettings object, so the new
size_aware_eviction key is picked up with no extra plumbing.

Addresses @mudler's review feedback: the application config setting
should land on the same UI settings page as the other handles.

Signed-off-by: supermario_leo <leo.stack@outlook.com>

---------

Signed-off-by: supermario_leo <leo.stack@outlook.com>
2026-06-21 17:17:04 +02:00

107 lines
6.3 KiB
Go

package config
// RuntimeSettings represents runtime configuration that can be changed dynamically.
// This struct is used for:
// - API responses (GET /api/settings)
// - API requests (POST /api/settings)
// - Persisting to runtime_settings.json
// - Loading from runtime_settings.json on startup
//
// All fields are pointers to distinguish between "not set" and "set to zero/false value".
type RuntimeSettings struct {
// Watchdog settings
WatchdogEnabled *bool `json:"watchdog_enabled,omitempty"`
WatchdogIdleEnabled *bool `json:"watchdog_idle_enabled,omitempty"`
WatchdogBusyEnabled *bool `json:"watchdog_busy_enabled,omitempty"`
WatchdogIdleTimeout *string `json:"watchdog_idle_timeout,omitempty"`
WatchdogBusyTimeout *string `json:"watchdog_busy_timeout,omitempty"`
WatchdogInterval *string `json:"watchdog_interval,omitempty"` // Interval between watchdog checks (e.g., 2s, 30s)
// Backend management
SingleBackend *bool `json:"single_backend,omitempty"` // Deprecated: use MaxActiveBackends = 1 instead
MaxActiveBackends *int `json:"max_active_backends,omitempty"` // Maximum number of active backends (0 = unlimited, 1 = single backend mode)
AutoUpgradeBackends *bool `json:"auto_upgrade_backends,omitempty"` // Automatically upgrade backends when new versions are detected
PreferDevelopmentBackends *bool `json:"prefer_development_backends,omitempty"` // Prefer development backend versions by default in UI
// Memory Reclaimer settings (works with GPU if available, otherwise RAM)
MemoryReclaimerEnabled *bool `json:"memory_reclaimer_enabled,omitempty"` // Enable memory threshold monitoring
MemoryReclaimerThreshold *float64 `json:"memory_reclaimer_threshold,omitempty"` // Threshold 0.0-1.0 (e.g., 0.95 = 95%)
// Eviction settings
ForceEvictionWhenBusy *bool `json:"force_eviction_when_busy,omitempty"` // Force eviction even when models have active API calls (default: false for safety)
SizeAwareEviction *bool `json:"size_aware_eviction,omitempty"` // Evict largest models first rather than least-recently-used (default: false)
LRUEvictionMaxRetries *int `json:"lru_eviction_max_retries,omitempty"` // Maximum number of retries when waiting for busy models to become idle (default: 30)
LRUEvictionRetryInterval *string `json:"lru_eviction_retry_interval,omitempty"` // Interval between retries when waiting for busy models (e.g., 1s, 2s) (default: 1s)
// Performance settings
Threads *int `json:"threads,omitempty"`
ContextSize *int `json:"context_size,omitempty"`
F16 *bool `json:"f16,omitempty"`
Debug *bool `json:"debug,omitempty"`
EnableTracing *bool `json:"enable_tracing,omitempty"`
TracingMaxItems *int `json:"tracing_max_items,omitempty"`
TracingMaxBodyBytes *int `json:"tracing_max_body_bytes,omitempty"` // Per-body cap in bytes; 0 disables the cap
EnableBackendLogging *bool `json:"enable_backend_logging,omitempty"`
// Security/CORS settings
CORS *bool `json:"cors,omitempty"`
CSRF *bool `json:"csrf,omitempty"`
CORSAllowOrigins *string `json:"cors_allow_origins,omitempty"`
// P2P settings
P2PToken *string `json:"p2p_token,omitempty"`
P2PNetworkID *string `json:"p2p_network_id,omitempty"`
Federated *bool `json:"federated,omitempty"`
// Gallery settings
Galleries *[]Gallery `json:"galleries,omitempty"`
BackendGalleries *[]Gallery `json:"backend_galleries,omitempty"`
AutoloadGalleries *bool `json:"autoload_galleries,omitempty"`
AutoloadBackendGalleries *bool `json:"autoload_backend_galleries,omitempty"`
// API keys - No omitempty as we need to save empty arrays to clear keys
ApiKeys *[]string `json:"api_keys"`
// Agent settings
AgentJobRetentionDays *int `json:"agent_job_retention_days,omitempty"`
// Open Responses settings
OpenResponsesStoreTTL *string `json:"open_responses_store_ttl,omitempty"` // TTL for stored responses (e.g., "1h", "30m", "0" = no expiration)
// Agent Pool settings
AgentPoolEnabled *bool `json:"agent_pool_enabled,omitempty"`
AgentPoolDefaultModel *string `json:"agent_pool_default_model,omitempty"`
AgentPoolEmbeddingModel *string `json:"agent_pool_embedding_model,omitempty"`
AgentPoolMaxChunkingSize *int `json:"agent_pool_max_chunking_size,omitempty"`
AgentPoolChunkOverlap *int `json:"agent_pool_chunk_overlap,omitempty"`
AgentPoolEnableLogs *bool `json:"agent_pool_enable_logs,omitempty"`
AgentPoolCollectionDBPath *string `json:"agent_pool_collection_db_path,omitempty"`
AgentPoolVectorEngine *string `json:"agent_pool_vector_engine,omitempty"` // chromem | postgres
AgentPoolDatabaseURL *string `json:"agent_pool_database_url,omitempty"` // PostgreSQL DSN when vector engine is postgres
AgentPoolAgentHubURL *string `json:"agent_pool_agent_hub_url,omitempty"` // override the agenthub.localai.io endpoint
// LocalAI Assistant settings — read live by the chat handler at request
// entry, so flipping the toggle takes effect on the next request.
LocalAIAssistantEnabled *bool `json:"localai_assistant_enabled,omitempty"` // negation of DisableLocalAIAssistant for UI clarity
// Branding / whitelabeling. Text fields are user-facing; *File fields hold
// just the basename of an uploaded asset under {DynamicConfigsDir}/branding/.
// All optional — empty values fall back to bundled LocalAI defaults.
InstanceName *string `json:"instance_name,omitempty"`
InstanceTagline *string `json:"instance_tagline,omitempty"`
LogoFile *string `json:"logo_file,omitempty"`
LogoHorizontalFile *string `json:"logo_horizontal_file,omitempty"`
FaviconFile *string `json:"favicon_file,omitempty"`
// Cloud-proxy MITM listener. MITMCADir is intentionally NOT
// exposed at runtime — the CA dir is a startup-only path and
// changing it after the CA has been generated would orphan
// trusted clients.
MITMListen *string `json:"mitm_listen,omitempty"`
// PIIDefaultDetectors are the token-classification detector models applied
// to any PII-enabled model that names no detectors of its own (so
// cloud-proxy/MITM redaction works without per-model config). No omitempty:
// an empty array must round-trip so the operator can clear it from the UI.
PIIDefaultDetectors *[]string `json:"pii_default_detectors"`
}