* feat(watchdog): add size-aware LRU eviction mode
When the model count hits the LRU limit or the memory reclaimer fires,
evict the largest model by on-disk file size first rather than the
least-recently-used one. For GGUF models the file size is a reliable
proxy for GPU/RAM footprint, so evicting the largest candidate maximises
freed memory per eviction round while keeping small utility models
(embeddings, classifiers, rerankers) resident.
Changes:
- `pkg/model/watchdog.go`: add `sizeAwareEviction` flag and
`modelSizes map[string]int64` to `WatchDog`; sort candidates by
`sizeBytes` desc (LRU time as tiebreaker) when the flag is set;
add `RegisterModelSize`, `SetSizeAwareEviction`, `GetSizeAwareEviction`
- `pkg/model/watchdog_options.go`: add `WithSizeAwareEviction` option
- `pkg/model/initializers.go`: stat model file after load and call
`RegisterModelSize` so size data is available before the first eviction
- `core/config/application_config.go`, `runtime_settings.go`: add
`SizeAwareEviction` field and `WithSizeAwareEviction` app option;
expose via `ToRuntimeSettings` / `ApplyRuntimeSettings` for the
`POST /api/settings` live-reload path
- `core/cli/run.go`: add `--size-aware-eviction` flag /
`LOCALAI_SIZE_AWARE_EVICTION` env var
- `core/application/startup.go`, `watchdog.go`: wire the new option
through to `NewWatchDog`
- `pkg/model/watchdog_test.go`: 5 new specs — option enable, dynamic
toggle, largest-first ordering, equal-size LRU tiebreaker, no-size
fallback to LRU, and size-map cleanup on eviction
Closes#9375
Signed-off-by: supermario_leo <leo.stack@outlook.com>
* refactor(watchdog): use vram estimation scaffolding for model size
Replace the brittle os.Stat(modelFile) approach with a proper call to
pkg/vram, which handles multi-file models (DownloadFiles, MMProj) and
all weight file types, not just single GGUF files.
- Add estimateModelSizeBytes() in core/backend/options.go that collects
all weight file URIs from the model config, resolves them to file://
URIs, and calls vram.Estimate() with the shared DefaultCachedSizeResolver
(15-min TTL cache avoids redundant stat calls on repeated loads)
- Thread the result through via a new WithModelSizeBytes() loader option
- In initializers.go, consume the pre-computed size instead of calling
os.Stat; if no size was supplied (e.g. for external/router-dispatched
models) the registration is simply skipped
Signed-off-by: supermario_leo <leo.stack@outlook.com>
* refactor(watchdog): use EstimateModel with HF fallback for size estimation
Switch estimateModelSizeBytes from calling vram.Estimate directly to the
unified vram.EstimateModel entry point, which adds automatic fallbacks:
file-based GGUF metadata → HF API → size string.
Also extract the HuggingFace repo ID from model URIs (huggingface://,
hf://, https://huggingface.co/ and org/model short-form) and pass it
as ModelEstimateInput.HFRepo, so models not yet downloaded locally can
still get a size estimate via the HF API.
Addresses @mudler's review feedback: "better to rely on EstimateModel
and pass by the HF URL of the model extracted from the URI".
Signed-off-by: supermario_leo <leo.stack@outlook.com>
* feat(webui): add Size-Aware Eviction toggle to settings page
The size-aware eviction setting was wired through the CLI flag and the
RuntimeSettings live-reload path (POST /api/settings) but had no handle
on the React settings page, so it could not be toggled from the UI.
Add a Size-Aware Eviction toggle to the Watchdog section, next to the
existing Force Eviction When Busy / LRU eviction handles. The settings
page loads and saves the whole RuntimeSettings object, so the new
size_aware_eviction key is picked up with no extra plumbing.
Addresses @mudler's review feedback: the application config setting
should land on the same UI settings page as the other handles.
Signed-off-by: supermario_leo <leo.stack@outlook.com>
---------
Signed-off-by: supermario_leo <leo.stack@outlook.com>
* feat: split remaining backends and drop embedded backends
- Drop silero-vad, huggingface, and stores backend from embedded
binaries
- Refactor Makefile and Dockerfile to avoid building grpc backends
- Drop golang code that was used to embed backends
- Simplify building by using goreleaser
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(gallery): be specific with llama-cpp backend templates
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(docs): update
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ci): minor fixes
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore: drop all ffmpeg references
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix: run protogen-go
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Always enable p2p mode
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Update gorelease file
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(stores): do not always load
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fix linting issues
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Simplify
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Mac OS fixup
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(llama-cpp): consistently select fallback
We didn't took in consideration the case where the host has the CPU
flagset, but the binaries were not actually present in the asset dir.
This made possible for instance for models that specified the llama-cpp
backend directly in the config to not eventually pick-up the fallback
binary in case the optimized binaries were not present.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore: adjust and simplify selection
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix: move failure recovery to BackendLoader()
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* comments
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* minor fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>