Compare commits

...

39 Commits

Author SHA1 Message Date
LocalAI [bot]
38350d363e fix(backends): enable ROCm/HIP GPU offload for ggml audio backends (#10666) (#10667)
qwen3-tts-cpp, omnivoice-cpp, acestep-cpp and vibevoice-cpp shipped
rocm-* variants that silently ran on CPU ([Load] backend: CPU). Two
coupled defects:

- The Makefiles passed -DGGML_HIPBLAS=ON, but the vendored ggml only
  understands -DGGML_HIP=ON (GGML_HIPBLAS was removed upstream), so the
  ggml-hip backend target was never created and no GPU code was built.
- The CMake foreach that links the ggml GPU backends into the module
  listed blas/cuda/metal/vulkan but not hip, so even a built ggml-hip
  would not have been linked and its static backend registration would
  never run.

CUDA users were unaffected because cublas passes the correct GGML_CUDA=ON
and the foreach already links cuda. Mirror the proven llama-cpp hipblas
block (ROCm clang CC/CXX + AMDGPU_TARGETS) and add hip to each foreach.
Upstream picks the best device via ggml_backend_init_best(), so no
runtime flag is needed once HIP is compiled and linked.


Assisted-by: Claude:claude-opus-4-8[1m] [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-04 09:08:20 +02:00
LocalAI [bot]
817136c20e chore: ⬆️ Update CrispStrobe/CrispASR to f35185b876fc482fcb2053a81a2697936ed5fcc0 (#10670)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-04 08:17:02 +02:00
LocalAI [bot]
8396ce1388 chore: ⬆️ Update ggml-org/llama.cpp to d4cff114c0084f1fbc9b4c62717eca8fb2ae494a (#10671)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-04 08:16:41 +02:00
LocalAI [bot]
348f3c87c0 fix(gpu-libs): bundle hipBLASLt TensileLibrary data so ROCm backends stop falling back (#10660) (#10672) the
The ROCm packager copied rocBLAS kernel data (rocblas/library/*.dat) into the
bundled lib/ dir and run.sh pointed ROCBLAS_TENSILE_LIBPATH at it, but the
parallel hipBLASLt data dir (hipblaslt/library/TensileLibrary_lazy_gfx*.dat)
was never packaged and no HIPBLASLT_TENSILE_LIBPATH was set. The bundled
libhipblaslt.so therefore resolved its per-arch kernel data relative to itself,
found nothing, and silently fell back to slow generic kernels, logging:

    rocblaslt error: Cannot read "TensileLibrary_lazy_gfx1201.dat": No such file or directory
    rocblaslt error: Could not load "TensileLibrary_lazy_gfx1201.dat"

Fix, mirroring the existing rocBLAS handling:
- package-gpu-libs.sh: extract the rocblas data-dir copy into a reusable
  copy_rocm_data_dir helper and call it for both rocblas and hipblaslt.
- llama-cpp/turboquant run.sh: export HIPBLASLT_TENSILE_LIBPATH when the
  bundled hipblaslt/library dir exists.

The helper takes an optional ROCM_BASE_DIRS override so the copy is unit
testable without a real ROCm install; add a regression test that runs
package_rocm_libs against a fabricated ROCm tree and asserts both data dirs
are bundled.

Note: this bundles whatever gfx*.dat the build image's ROCm provides. If a
given arch's tensile data is absent from the shipped ROCm, that arch still
needs a ROCm bump; the packaging gap itself is fixed for every supported arch.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-04 08:14:12 +02:00
LocalAI [bot]
13310905a3 chore: ⬆️ Update ikawrakow/ik_llama.cpp to bbc7de475178dd0535c16ad85f204a2529806c9d (#10669)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-03 23:35:41 +02:00
LocalAI [bot]
2cbb3c96b3 fix(gallery): block SSRF in gallery config URL fetch (#10665) (#10673)
POST /models/apply with an empty "id" fetches the attacker-supplied
"url" gallery config directly via http.Client, with no check that the
URL resolves to a public IP. In the default Docker deployment no API key
is configured, so any network-reachable client can coerce LocalAI into
issuing requests to internal services or cloud-metadata endpoints (and
exfiltrate a small slice of the response through the job error message).

Guard the config fetch chokepoints (GetGalleryConfigFromURL and
GetGalleryConfigFromURLWithContext, which back both the /models/apply
worker and gallery installs) with utils.ValidateExternalURL, matching
the protection already applied to the CORS proxy and image/video/audio
download paths. Only plain http(s) URLs are validated; non-network
schemes (huggingface://, github:, oci://, ollama://, file://) resolve to
fixed public services or local files and are left untouched.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-03 21:32:42 +00:00
Ettore Di Giacinto
1152acc167 Revert "feat(config): default swa_full:true for sliding-window-attention models" (#10674)
Revert "feat(config): default swa_full:true for sliding-window-attention mode…"

This reverts commit 02b007a31e.
2026-07-03 22:46:44 +02:00
walcz-de
cc8ee62db0 feat(pii): export PII/audit events as a Prometheus counter (#10641)
The PII EventStore ring buffer is capacity-bound and meant for
recent-audit browsing via /api/pii/events; operators also want a
monotonic, scrape-friendly signal on /metrics — how many
detections/masks/blocks per hour, per origin, and whether the filter
stopped firing after a deploy (silent-failure class).

EventStore.Record is the single choke point every producer already goes
through (request middleware, response scrubbing, MITM proxy
connects/intercepts), so one lazily-initialised counter there covers all
paths without touching any producer:

  localai_pii_events_total{kind, origin, action, direction}

Same lazy otel.Meter pattern as core/services/routing/billing, so the
counter lands on the Prometheus-backed global MeterProvider installed by
the monitoring service. No behaviour change; label cardinality is
bounded (enum-like fields only, no pattern IDs or user IDs).

Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: stefanwalcz <stefan.walcz@walcz.de>
2026-07-03 20:36:15 +00:00
LocalAI [bot]
bfd6c09d88 chore(model gallery): 🤖 add 1 new models via gallery agent (#10663)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-03 18:02:09 +02:00
Richard Palethorpe
eb32cd9073 feat(realtime): eager blocking pipeline warm-up + /backend/load API (#10662)
Realtime sessions previously lazy-loaded each pipeline sub-model (VAD,
transcription, LLM, TTS) on first use, so every cold session paid a
per-request model-load stall and load errors only surfaced mid-stream.

Warm the whole pipeline eagerly and blockingly at session start
(including the voice-gate speaker-recognition model, which an enforced
gate blocks each utterance on; compaction's summary_model stays lazy
since it only runs off the response path):
- Add backend.PreloadModel / PreloadModelByName as the single load path
  for every modality (no transcription special-case; backend-omitted
  configs are deprecated).
- The realtime session blocks on Model.Warmup and returns a
  model_load_error to the client if any stage fails to load;
  updateSession warms in the background. Opt out per pipeline with
  pipeline.disable_warmup, exposed as a UI toggle via the
  config-metadata registry.

Add a LocalAI-native POST /backend/load (and /v1/backend/load) that
pre-loads a model -- expanding realtime pipelines into their sub-models
-- as the inverse of /backend/shutdown. There is one preload engine
(backend.PreloadStages): the realtime Warmup methods, /backend/load and
the --load-to-memory startup flag all use it, so --load-to-memory now
also expands pipeline models and records load-failure traces. Pipeline
sub-model alias resolution is likewise shared
(ModelConfigLoader.LoadResolvedModelConfig). Surface the endpoint
everywhere an admin manages models:
- MCP admin tool load_model (httpapi + inproc clients, safety/catalog
  prompts, catalog/dispatch tests).
- "Load into memory" action in the React models UI.
- Swagger regenerated; docs moved to the general backend-monitor page
  since it is not realtime-specific.

Fix a Traces UI crash ("json: unsupported value: -Inf"): audio-snippet
RMS/peak now floor at a finite dBFS, and backend-trace data is sanitized
to drop non-finite floats before marshaling. The sanitizer is
copy-on-write -- it runs on every RecordBackendTrace, so containers are
only re-allocated on the paths that actually changed.

Migrate core/http/openresponses_test.go onto the prebuilt mock-backend
the rest of the http suite already uses -- it was the last spec still
pointing at a real HuggingFace model, so it 404'd wherever no vision
backend was built -- and fix its item_reference specs to send the
spec's "id" field instead of "item_id", which the handler never
accepted.

Assisted-by: Claude:claude-opus-4-8 Claude Code

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-07-03 18:00:37 +02:00
alaningtrump
80ec22945a refactor: use the built-in max/min to simplify the code (#10657)
Signed-off-by: alaningtrump <alaningtrump@outlook.com>
2026-07-03 17:59:26 +02:00
LocalAI [bot]
7a3583b52c fix(python-backends): parse tool-call arguments for chat templates and split implicit reasoning blocks (#10658)
Two bugs broke OpenAI-style tool calling on the MLX backend (and any
Python backend sharing backend/python/common), reproduced end-to-end on
LocalAI v4.5.5 with the metal-mlx backend and
mlx-community/Qwen3.5-2B-MLX-8bit.

messages_to_dicts left each tool call's function.arguments as the raw
OpenAI-wire JSON string. HuggingFace chat templates (e.g. Qwen3.5)
iterate arguments as a mapping (.items()), so any request whose history
contained a prior assistant tool_calls message failed with HTTP 500
"Generation failed: Can only get item pairs from a mapping." — breaking
every agent loop on its second turn. Decode the string back into a dict
so the template sees a mapping.

split_reasoning returned ("", text) whenever the opening think tag was
absent. Models like Qwen3.5 open the assistant turn already inside
thinking, so the generated text carries only the closing </think>; the
whole chain-of-thought leaked into content. When the opener is missing
but the closer is present, treat everything before the closer as
reasoning.

Adds platform-independent unit tests under backend/python/common
(stdlib-only, no MLX/venv required, following parent_watch_test.py).

Assisted-by: Claude Code:claude-opus-4-8

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-03 12:13:37 +02:00
LocalAI [bot]
715d4ed8e5 chore: ⬆️ Update ggml-org/llama.cpp to fdb1db877c526ec90f668eca1b858da5dba85560 (#10647)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-03 00:46:56 +02:00
LocalAI [bot]
9fcc9c0d43 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 87fc8701ff4da81a7d2a91ec0695f95eb3066a47 (#10649)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-03 00:46:41 +02:00
LocalAI [bot]
3c67b5b746 chore: ⬆️ Update CrispStrobe/CrispASR to 9a26976a8c8cf5af0afcdd04463cf8ba91e96a54 (#10648)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-03 00:46:25 +02:00
LocalAI [bot]
bea66fd84e chore: ⬆️ Update leejet/stable-diffusion.cpp to 2574f5936571645f784b77623e1f09bad97d948a (#10650)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-03 00:46:10 +02:00
LocalAI [bot]
f7a5dfd5ae chore: ⬆️ Update vllm-metal (darwin) to v0.3.0.dev20260701212152 (#10646)
⬆️ Update vllm-project/vllm-metal (darwin)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-03 00:45:36 +02:00
LocalAI [bot]
6bcaf30c14 chore: ⬆️ Update localai-org/privacy-filter.cpp to 735a6c28607ee82afc3a670383f41b55266a3b9a (#10628)
⬆️ Update localai-org/privacy-filter.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-03 00:45:17 +02:00
LocalAI [bot]
ef15b4bfda fix(vllm): install ROCm vLLM from the AMD wheel index on Python 3.12 (#10651)
* fix(vllm): install ROCm vLLM from the AMD wheel index on Python 3.12

The rocm-vllm backend crashed at load with "No module named 'vllm'".
requirements-hipblas-after.txt requested a bare `vllm`, which resolves to
the CUDA-only PyPI wheel; that wheel is unusable on an AMD GPU. vLLM's
prebuilt ROCm wheels live on a dedicated index (https://wheels.vllm.ai/rocm/)
and are published only for CPython 3.12, so on the backend's default 3.10
the installer silently falls back to the CUDA wheel.

Add a hipblas branch to backend/python/vllm/install.sh that pins Python to
3.12 and installs vllm from the ROCm wheel index, hiding the bare-`vllm`
after-file so installRequirements installs only the base ROCm
torch/transformers first and does not pull the CUDA wheel.

Fixes #10642

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(vllm): drop the dead hipblas-after requirement and its hide dance

requirements-hipblas-after.txt (a bare `vllm`) is never installed for
hipblas: installRequirements only adds requirements-${BUILD_PROFILE}-after.txt
when BUILD_TYPE != BUILD_PROFILE, and for hipblas they are equal. So the file
was dead and the install.sh hide/restore of it was a no-op. Remove both. The
hipblas branch already installs vllm explicitly from the ROCm wheel index, so
deleting the bare-`vllm` file also removes a latent CUDA-wheel trap should the
installRequirements gap ever be closed.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-03 00:44:55 +02:00
LocalAI [bot]
237bce48e8 feat(ui): forking chat - retry any answer, copy, duplicate, branch (#10645) (#10654)
* feat(ui): clone a chat into a new conversation (#10645)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): retry any assistant answer, not just the last (#10645)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): copy an entire chat to the clipboard (#10645)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): branch a new chat from any assistant answer (#10645)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ui): send truncated history on mid-conversation retry (#10645)

Mid-conversation retry regenerated an answer with the downstream turns
still in the model's context. handleRegenerate truncated the DOM history
via updateChatSettings (a scheduled state update), but the synchronous
sendMessage that followed read the stale, pre-truncation history from its
closure to build the outbound API payload. Thread the intended base
history explicitly through sendMessage's options.baseHistory so the
request body matches the truncated view. Backward compatible: the normal
send path (no baseHistory) is unchanged.

Also guard two minor issues in Chat.jsx: the "Branch from here" button now
renders under !isStreaming to match the retry button, and the duplicate
toast only fires when forkChat returns a chat (not on a null result).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-03 00:04:44 +02:00
LocalAI [bot]
a4e6e01e4d fix(process): give backend workers a parent-death safety net (#10639)
* fix(grpc): self-terminate backend workers when LocalAI dies non-gracefully

Symptom: a backend model-worker subprocess (the per-model gRPC server LocalAI
spawns) can be orphaned and linger — holding VRAM and its listen port — if the
LocalAI process is killed non-gracefully (e.g. a supervisor's graceful-shutdown
grace period elapses and LocalAI is SIGKILLed) before its own teardown runs.

Root cause: LocalAI's graceful teardown (pkg/signals/handler.go installs the
SIGINT/SIGTERM handler; core/cli/run.go registers app.Shutdown ->
ModelLoader.StopAllGRPC -> process.Stop in pkg/model/process.go) only runs when
LocalAI receives a catchable signal and survives long enough to run its
handlers. Backends are spawned via github.com/mudler/go-processmanager v0.1.1,
whose getSysProcAttr() sets Setpgid:true (own process group, so the group can be
signalled) but never PR_SET_PDEATHSIG/Pdeathsig, and exposes no Config field or
option for a caller to inject/extend SysProcAttr. LocalAI fully delegates
spawning to that library (it never builds the exec.Cmd itself), so it cannot set
a kernel parent-death signal at the spawn site. If LocalAI is SIGKILLed, nothing
tells the backend to exit and it is reparented to init.

Fix: add a best-effort, backend-side safety net at the one shared choke point
every out-of-process Go backend routes through — grpc.StartServer / RunServer in
pkg/grpc. On startup it captures getppid() and polls; when the process is
reparented (getppid changes / becomes 1 — the standard POSIX signal the original
parent died) it logs and self-terminates. getppid() reparent detection is
portable (Linux + macOS), unlike Linux-only PR_SET_PDEATHSIG. Toggle via
LOCALAI_BACKEND_PARENT_WATCH (default on; off on Windows) and
LOCALAI_BACKEND_PARENT_WATCH_INTERVAL. This is strictly a backstop alongside the
existing graceful SIGTERM->grace->SIGKILL teardown, which is unchanged.

Scope/limitations: covers Go-based backends (everything using pkg/grpc). The
C++ backends (e.g. llama-cpp) and Python backends do not route through
pkg/grpc and are not covered by this mechanism — they would each need an
equivalent parent-death check (follow-up). The fully general fix is for
go-processmanager to expose SysProcAttr injection so LocalAI can set Pdeathsig
at spawn for every backend regardless of language (suggested upstream follow-up;
out of scope for this LocalAI-only PR).

Test: pkg/grpc/parentwatch_test.go builds a real test -> middle -> grandchild
process tree, lets the middle process exit to orphan the grandchild running the
real watchParentDeath, and asserts it detects the reparent and self-terminates.
Unix-only (build-tagged), runs in CI (Linux).

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(process): extend parent-death backstop to C++ and Python backends

The Go parent-death watcher (pkg/grpc/parentwatch.go, commit 772b435d5)
only protects backends that route through pkg/grpc. C++ and Python
backends don't, so the originally-reported case — the llama.cpp gRPC
worker surviving a non-graceful LocalAI death — was still uncovered.

Extend the same best-effort backstop to both languages, reusing the
exact mechanism and semantics:

- capture getppid() at startup, skip if already orphaned (<=1)
- a background thread polls getppid() and self-exits on reparenting
  (getppid() != orig || == 1), portable across Linux/macOS, no-op on
  Windows
- same env vars: LOCALAI_BACKEND_PARENT_WATCH (default on; falsy
  false/0/no/off disable) and LOCALAI_BACKEND_PARENT_WATCH_INTERVAL
  (default 2s; accepts Go-style durations like 500ms/2s/1m)

C++: implemented in backend/cpp/llama-cpp (the reported, most-used C++
backend) as a dependency-free header parent_watch.h, wired into
grpc-server.cpp's main() and copied at build time via prepare.sh. C++
backends have no shared server scaffolding, so other C++ backends
(ds4, ik-llama-cpp, privacy-filter, ...) are not yet covered and would
each need the same one-line include+call as follow-ups.

Python: implemented once in the shared common/parent_watch.py and armed
from common/grpc_auth.py's get_auth_interceptors() — the single helper
every one of the 35 Python backends invokes while building its gRPC
server — so all Python backends (and future ones) are covered with no
per-backend edits and no duplicated implementation.

Tests (real process-tree reparent detection, mirroring the Go test):
- backend/cpp/llama-cpp/parent_watch_test.cpp (via run-unit-tests.sh)
- backend/python/common/parent_watch_test.py (python -m unittest)

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Sonnet 5 <noreply@anthropic.com>
2026-07-02 19:16:48 +02:00
LocalAI [bot]
6eea3ef2ac fix(backends): make backend install ops idempotent unless forced (#10643)
* fix(backends): make backend install ops idempotent unless forced

POST /backends/apply hardcoded force=true through
LocalBackendManager.InstallBackend, so applying an already-installed
backend re-downloaded and re-extracted the whole artifact every time.
API clients that ensure a backend exists at startup paid a full OCI
image pull on every boot.

Backend install ops now default to non-forced — an installed, runnable
backend short-circuits (the orphaned-meta reinstall path in
InstallBackendFromGallery is preserved) — and reinstall stays available:

- ManagementOp gains a Force field; the local manager passes it through
  instead of hardcoding true.
- /backends/apply accepts an optional "force" boolean in the body.
- The React UI install route keeps forcing, since its button doubles as
  the explicit "Reinstall backend" action.

Distributed installs already behaved this way (workers skip when the
binary exists unless force is set); this aligns single-node behavior.

Assisted-by: Claude:claude-fable-5 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backends): don't force-reinstall LOCALAI_EXTERNAL_BACKENDS on boot

The startup loop for LOCALAI_EXTERNAL_BACKENDS runs
InstallExternalBackend for each listed backend on every boot, and its
gallery-name path hardcoded force=true — so every start re-downloaded
and re-extracted each listed backend's OCI image even when it was
installed and runnable. Supervising apps that list several backends
paid several full OCI pulls per launch.

Give InstallExternalBackend an explicit force parameter (it only
affects the gallery-name fallback; URI installs always write) and pass:

- false from the boot loop and `local-ai backends install` (idempotent
  ensure — `backends upgrade` is the refresh path),
- op.Force from the local manager's external-URI op,
- the request's force on the worker install path and true on its
  upgrade path (behavior unchanged).

Assisted-by: Claude:claude-fable-5 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 19:16:29 +02:00
LocalAI [bot]
ad97bcbbdd chore(model gallery): 🤖 add 1 new models via gallery agent (#10644)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-02 19:16:09 +02:00
walcz-de
9d8ff90941 fix(cloud-proxy): parameter compatibility with newest reasoning models (#10640)
Newest cloud reasoning models reject two parameters the cloud-proxy
backend currently sends:

- Anthropic (claude-opus-4-x) and OpenAI (gpt-5.x) return 400 when
  temperature is present: "'temperature' is deprecated for this model".
  OpenAI-compatible clients typically send only the server-side DEFAULT
  sampling values rather than user intent, so the translators now forward
  neither temperature nor top_p and let the upstream apply its own
  defaults.
- OpenAI gpt-5.x rejects max_tokens ("Unsupported parameter: 'max_tokens'
  ... Use 'max_completion_tokens' instead"). The OpenAI translator now
  serializes the token limit as max_completion_tokens, which current
  chat-completions models accept.

Verified live against claude-opus-4-8, gpt-5.5 and gemini-3.1-pro
(Gemini OpenAI-compat endpoint). Tests updated to the new contract.

Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: stefanwalcz <stefan.walcz@walcz.de>
2026-07-02 19:15:43 +02:00
LocalAI [bot]
29001a88c1 fix(distributed): don't let a dead worker pin the model-load advisory lock (#10600)
* fix(distributed): don't let a dead worker pin the model-load advisory lock

In distributed mode a chat request could fail with:

  failed to route model with internal loader: routing model ...:
  loading model ...: advisorylock: acquiring lock <id>:
  ERROR: canceling statement due to lock timeout (SQLSTATE 55P03)

Root cause is two independent defects in the cross-replica model-load path:

1. SmartRouter.Route holds a per-model PostgreSQL advisory lock for the whole
   cold-load sequence, which includes installBackendOnNode -> InstallBackend,
   a NATS request-reply with a 15m deadline (DefaultBackendInstallTimeout) that
   ignored ctx. When the chosen worker died mid-install, the holder sat on the
   lock for up to 15m. The detached loadCtx (WithoutCancel) had no deadline, so
   nothing capped the hold.

2. The acquiring statement, pg_advisory_lock(), is subject to any deployment
   global lock_timeout. A common operator setting (e.g. 10s) aborts the wait
   with SQLSTATE 55P03, so every other replica's request for that model hard
   -errored instead of waiting for the in-progress load and reusing it. For the
   ~15m window the model was effectively unroutable.

Fixes:

- advisorylock.WithLockCtx (postgres): SET lock_timeout = 0 on its dedicated
  connection (RESET before it returns to the pool) so the Go context, not a
  deployment-wide GUC, governs how long we wait. Waiters now block and then
  re-check, reusing the model another replica just loaded.

- SmartRouter: bound the detached loadCtx with a single ModelLoadCeiling so the
  lock is always released in bounded time even if a sub-step wedges. Default is
  the configured backend.install deadline + 10m (staging + LoadModel margin),
  so a legitimately slow load is never cut.

- installBackendOnNode: use singleflight.DoChan + select on ctx.Done() so the
  install wait honors cancellation; the ceiling can then actually free a caller
  pinned behind a dead worker. The shared install still coalesces via
  singleflight.

Reproduced both defects as failing tests first (a real 55P03 against a
testcontainer with a short lock_timeout; a wedged install that blocks Route)
and confirmed green.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(distributed): bound advisory-lock wait instead of disabling lock_timeout

Setting lock_timeout = 0 to override a deployment's short global lock_timeout
meant "wait forever" server-side. Safe for SmartRouter.Route (its loadCtx now
carries the model-load ceiling) but unsafe for the schema-migration callers
that pass context.Background(): a holder whose session never releases would
hang them indefinitely.

Derive the server-side lock_timeout from the caller's context instead: its
remaining budget plus a margin (so the Go context's cancellation still wins
with a clean error and the server bound is only a backstop), or a finite
30m backstop when the context has no deadline. Never zero - "wait forever"
is no longer possible, while a deployment's hostile short lock_timeout is
still overridden so legitimate cross-replica waits don't fail with 55P03.

Added a spec proving a deadline-less waiter gives up at the (shrunk) backstop
rather than hanging.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-07-02 09:52:51 +02:00
LocalAI [bot]
b0bfa0852e chore: ⬆️ Update CrispStrobe/CrispASR to fcbc8718e654995e3bd2d0c98bcb8e55e297d23c (#10634)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-02 09:48:20 +02:00
LocalAI [bot]
39a93e91cf chore: ⬆️ Update vllm-metal (darwin) to v0.3.0.dev20260701132215 (#10633)
⬆️ Update vllm-project/vllm-metal (darwin)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-02 09:48:08 +02:00
LocalAI [bot]
26e0c98967 chore: ⬆️ Update leejet/stable-diffusion.cpp to 3590aa8d626e671a1b1dc84506ea2932a243a480 (#10631)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-02 09:47:54 +02:00
LocalAI [bot]
9acca54b25 chore: ⬆️ Update mudler/parakeet.cpp to e8acc6172a94e20a952cf1843decace5d771a94b (#10629)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-02 09:47:41 +02:00
LocalAI [bot]
2728e6000e chore: ⬆️ Update ikawrakow/ik_llama.cpp to 068b173649f2fd8dc96b35ada5a0b76d8985105d (#10632)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-02 09:47:28 +02:00
LocalAI [bot]
006310d746 chore: ⬆️ Update ggml-org/llama.cpp to 4fc4ec5541b243957ae5099edb67372f8f3b550e (#10630)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-02 09:47:15 +02:00
LocalAI [bot]
05acdb1778 chore: ⬆️ Update ggml-org/whisper.cpp to 6fc7c33b4c3a2cec83e4b65abd5e96a890480375 (#10635)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-02 09:47:01 +02:00
LocalAI [bot]
5e68b5700c chore(model-gallery): ⬆️ update checksum (#10637)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-02 09:26:32 +02:00
pos-ei-don
7910018249 fix(vllm): non-streaming tool-call regression after #10351 (#10638)
fix(vllm): non-streaming tool-call regression after #10351 (native_streaming is a capability flag, not a state flag)

#10351 introduced native streaming via `parser.extract_tool_calls_streaming`
and gated the post-loop `extract_tool_calls` block on `native_streaming and
not native_streaming_error`. That works for streaming requests, but for
non-streaming requests the same flag is still True (it only means "the
parser can stream", not "we actually streamed"), so the block was skipped
and the `elif` cleared `content = ""` — the tool call was silently lost.

Symptom: non-streaming chat.completions with `tools=[...]` returns
`finish_reason: "stop"` with `content: ""` and no `tool_calls`. Streaming
requests are unaffected.

Fix: gate both branches on `streaming` too, so the extract_tool_calls
block runs for non-streaming requests (and for streaming requests that
fell back to the buffered path).

Reproduction (vLLM 0.24, Qwen3-Coder-Next-NVFP4, qwen3_coder parser):

    curl -s -X POST http://localhost:8080/v1/chat/completions \
      -H 'Content-Type: application/json' \
      -d '{"model":"coder","stream":false,
           "messages":[{"role":"user","content":"7*8 via calc"}],
           "tools":[{"type":"function","function":{"name":"calc",
             "parameters":{"type":"object",
               "properties":{"expression":{"type":"string"}}}}}]}'

Before: finish_reason: "stop", content: "", tool_calls: []
After:  finish_reason: "tool_calls", tool_calls[0].function.name: "calc"

Streaming path re-verified in the same setup: delta.tool_calls arrives
token-by-token, finish_reason: "tool_calls", no raw XML in content.

Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
2026-07-02 09:26:14 +02:00
LocalAI [bot]
1a03712a6f fix(hipblas): symlink amdgpu.ids so ROCm backends find the ASIC ID table (#10627)
* fix(hipblas): symlink amdgpu.ids so ROCm backends find the ASIC ID table

ROCm's bundled libdrm_amdgpu looks up the GPU ASIC ID table at a
hardcoded fallback path, /opt/amdgpu/share/libdrm/amdgpu.ids, which is
only populated by AMD's full amdgpu-install (graphics/DKMS) stack. The
hipblas image is compute-only and doesn't have it, so every model load
logs "No such file or directory" and the GPU can't be identified.
Symlink it to the equivalent file already shipped by Ubuntu's
libdrm-amdgpu1 package.

Fixes #10624

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(hipblas): correct amdgpu.ids source package name in comment

Verified against the real rocm/dev-ubuntu-24.04:7.2.1 image with
hipblas-dev/hipblaslt-dev/rocblas-dev installed: /usr/share/libdrm/amdgpu.ids
is owned by libdrm-common, not libdrm-amdgpu1 as the comment said.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 09:25:14 +02:00
LocalAI [bot]
703ea32de6 chore: ⬆️ Update vllm-metal (darwin) to v0.3.0.dev20260630095652 (#10616)
⬆️ Update vllm-project/vllm-metal (darwin)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-01 21:56:59 +02:00
LocalAI [bot]
751db06e35 chore: ⬆️ Update CrispStrobe/CrispASR to 8fd9db8fec8cb5e929d23d3267ed5817794feb1a (#10615)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-01 21:56:41 +02:00
LocalAI [bot]
f46c0e9c83 docs: ⬆️ update docs version mudler/LocalAI (#10614)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-01 21:56:21 +02:00
LocalAI [bot]
0d8adfc59a chore: ⬆️ Update ggml-org/llama.cpp to 0eca4d490e591d4e93058d07540cf47278a72577 (#10617)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-01 09:31:50 +02:00
119 changed files with 3742 additions and 425 deletions

View File

@@ -171,6 +171,17 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ]; then \
ln -s /opt/rocm-**/lib/llvm/lib/libomp.so /usr/lib/libomp.so \
; fi
# ROCm's bundled libdrm_amdgpu is built with a hardcoded fallback lookup path
# for the ASIC ID table (/opt/amdgpu/share/libdrm/amdgpu.ids), which only exists
# if AMD's full amdgpu graphics/DKMS stack is installed. This compute-only image
# doesn't have it, so hipblas/rocBLAS log "No such file or directory" on every
# model load and can fail to identify the GPU. Point it at the equivalent file
# Ubuntu's libdrm-common package already ships.
RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ -f /usr/share/libdrm/amdgpu.ids ] && [ ! -e /opt/amdgpu/share/libdrm/amdgpu.ids ]; then \
mkdir -p /opt/amdgpu/share/libdrm && \
ln -s /usr/share/libdrm/amdgpu.ids /opt/amdgpu/share/libdrm/amdgpu.ids \
; fi
RUN expr "${BUILD_TYPE}" = intel && echo "intel" > /run/localai/capability || echo "not intel"
# Cuda

View File

@@ -1,5 +1,5 @@
IK_LLAMA_VERSION?=29431b31c89e79c10f8736e8f2742485ba1713d6
IK_LLAMA_VERSION?=bbc7de475178dd0535c16ad85f204a2529806c9d
LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
CMAKE_ARGS?=

View File

@@ -101,4 +101,13 @@ if(LLAMA_GRPC_BUILD_TESTS)
target_link_libraries(message_content_test PRIVATE ${_LLAMA_COMMON_TARGET})
target_compile_features(message_content_test PRIVATE cxx_std_17)
add_test(NAME message_content_test COMMAND message_content_test)
# Parent-death watcher test (parent_watch.h) — standard library only, but
# needs a threading runtime for std::thread.
find_package(Threads REQUIRED)
add_executable(parent_watch_test parent_watch_test.cpp parent_watch.h)
target_include_directories(parent_watch_test PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
target_link_libraries(parent_watch_test PRIVATE Threads::Threads)
target_compile_features(parent_watch_test PRIVATE cxx_std_17)
add_test(NAME parent_watch_test COMMAND parent_watch_test)
endif()

View File

@@ -1,5 +1,5 @@
LLAMA_VERSION?=6f4f53f2b7da54fcdbbecaaa734337c337ad6176
LLAMA_VERSION?=d4cff114c0084f1fbc9b4c62717eca8fb2ae494a
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
CMAKE_ARGS?=

View File

@@ -75,6 +75,8 @@
#include <windows.h>
#endif
#include "parent_watch.h" // best-effort parent-death backstop (see header)
using grpc::Server;
using grpc::ServerBuilder;
@@ -3442,6 +3444,10 @@ int main(int argc, char** argv) {
}
}
// Best-effort backstop: self-terminate if the LocalAI process that spawned
// us dies without cleaning us up (see parent_watch.h).
llama_grpc::start_parent_death_watcher();
server_context ctx_server;
BackendServiceImpl service(ctx_server);

View File

@@ -0,0 +1,179 @@
// Parent-death watcher (best-effort backstop) for the llama.cpp gRPC backend.
//
// LocalAI spawns this backend as a child process and, on a clean shutdown,
// tears it down itself (SIGTERM -> grace -> SIGKILL). That graceful path only
// runs when LocalAI receives a catchable signal and lives long enough to run
// its handlers. If LocalAI is SIGKILLed (e.g. a supervising process's grace
// period elapses first), that teardown never runs and this backend would be
// reparented to init and linger, holding VRAM and its listen port.
//
// The watcher here is a best-effort backstop for exactly that case: it does
// NOT replace the graceful teardown, it only covers the "parent vanished
// without cleaning up" path. It detects reparenting: when the process that
// spawned this backend dies, the kernel reparents us to the nearest sub-reaper
// or to init (PID 1), so getppid() stops matching the value captured at
// startup. This getppid() approach is portable across Linux/macOS (unlike the
// Linux-only PR_SET_PDEATHSIG), which is why it is used here, mirroring the Go
// backends' pkg/grpc/parentwatch.go. It is disabled on Windows, which has no
// equivalent orphan-reparenting semantics.
//
// This header is intentionally dependency-free (C++ standard library only) so
// it can be exercised by a standalone unit test (parent_watch_test.cpp) without
// building the full llama.cpp + gRPC backend.
#ifndef LLAMA_GRPC_PARENT_WATCH_H
#define LLAMA_GRPC_PARENT_WATCH_H
#include <algorithm>
#include <cctype>
#include <chrono>
#include <cstdio>
#include <cstdlib>
#include <functional>
#include <string>
#include <thread>
#if !defined(_WIN32)
#include <unistd.h> // getppid(2), _exit(2)
#endif
namespace llama_grpc {
// Env var names are shared verbatim with the Go and Python backends for
// consistency across languages.
inline const char *kEnvParentWatch() { return "LOCALAI_BACKEND_PARENT_WATCH"; }
inline const char *kEnvParentWatchInterval() { return "LOCALAI_BACKEND_PARENT_WATCH_INTERVAL"; }
// Default poll interval in milliseconds. Matches the Go side's 2 * time.Second.
inline long parent_watch_default_interval_ms() { return 2000; }
namespace detail {
inline std::string trim_lower(const std::string &in, bool lower) {
size_t a = in.find_first_not_of(" \t\r\n");
size_t b = in.find_last_not_of(" \t\r\n");
if (a == std::string::npos) {
return "";
}
std::string s = in.substr(a, b - a + 1);
if (lower) {
std::transform(s.begin(), s.end(), s.begin(),
[](unsigned char c) { return std::tolower(c); });
}
return s;
}
} // namespace detail
// parent_watch_enabled reports whether the watcher should run. Enabled by
// default; a falsey value ("false"/"0"/"no"/"off", case-insensitive) disables
// it, matching the Go implementation's exact semantics.
inline bool parent_watch_enabled() {
#if defined(_WIN32)
return false;
#else
const char *v = std::getenv(kEnvParentWatch());
if (v == nullptr || v[0] == '\0') {
return true;
}
const std::string s = detail::trim_lower(v, true);
return !(s == "false" || s == "0" || s == "no" || s == "off");
#endif
}
// parent_watch_interval_ms returns the poll interval in milliseconds. Accepts
// Go-style duration strings ("500ms", "2s", "1m") for cross-language parity, or
// a bare number interpreted as seconds. Defaults to
// parent_watch_default_interval_ms().
inline long parent_watch_interval_ms() {
const long def = parent_watch_default_interval_ms();
const char *v = std::getenv(kEnvParentWatchInterval());
if (v == nullptr || v[0] == '\0') {
return def;
}
const std::string s = detail::trim_lower(v, false);
if (s.empty()) {
return def;
}
size_t i = 0;
while (i < s.size() && (std::isdigit((unsigned char)s[i]) || s[i] == '.')) {
i++;
}
if (i == 0) {
return def;
}
double num = 0.0;
try {
num = std::stod(s.substr(0, i));
} catch (...) {
return def;
}
const std::string unit = s.substr(i);
long ms;
if (unit == "ms") {
ms = (long)num;
} else if (unit == "s" || unit.empty()) {
ms = (long)(num * 1000.0);
} else if (unit == "m") {
ms = (long)(num * 60000.0);
} else {
return def; // unrecognized unit
}
return ms > 0 ? ms : def;
}
#if !defined(_WIN32)
// parent_died reports whether this process has been reparented away from the
// parent it had when the watcher started. Reparenting is the standard POSIX
// signal that the original parent (here, the LocalAI process that spawned this
// backend) has exited: the orphan is handed to the nearest sub-reaper or to
// init (PID 1), so getppid() no longer matches the value captured at startup.
inline bool parent_died(pid_t orig_ppid) {
const pid_t ppid = getppid();
return ppid != orig_ppid || ppid == 1;
}
// watch_parent_death polls until parent_died reports the original parent is
// gone, then invokes on_death. It blocks, so run it on its own thread.
inline void watch_parent_death(pid_t orig_ppid, long interval_ms,
const std::function<void()> &on_death) {
for (;;) {
std::this_thread::sleep_for(std::chrono::milliseconds(interval_ms));
if (parent_died(orig_ppid)) {
on_death();
return;
}
}
}
#endif
// start_parent_death_watcher installs the best-effort safety net described in
// the file header on the calling backend process. It is a no-op when disabled,
// on Windows, or when the process is already orphaned at startup
// (getppid() <= 1). This is a backstop alongside — never a replacement for —
// LocalAI's graceful teardown.
inline void start_parent_death_watcher() {
#if !defined(_WIN32)
if (!parent_watch_enabled()) {
return;
}
const pid_t orig_ppid = getppid();
// A parent of 1 (or less) at startup means we were already orphaned (or
// launched directly under init) — there is no original parent to watch for.
if (orig_ppid <= 1) {
return;
}
const long interval_ms = parent_watch_interval_ms();
std::thread([orig_ppid, interval_ms]() {
watch_parent_death(orig_ppid, interval_ms, [orig_ppid]() {
fprintf(stderr,
"backend parent process (pid %d) exited without stopping "
"this backend; self-terminating to avoid orphaning\n",
(int)orig_ppid);
fflush(stderr);
_exit(1);
});
}).detach();
#endif
}
} // namespace llama_grpc
#endif // LLAMA_GRPC_PARENT_WATCH_H

View File

@@ -0,0 +1,197 @@
// Unit tests for the parent-death watcher (parent_watch.h).
//
// Build & run standalone (C++ standard library only, no nlohmann/json needed):
// g++ -std=c++17 -pthread parent_watch_test.cpp -o t && ./t
//
// The core test (TestDetectsReparent) builds a genuine two-level process tree
// (test -> middle -> grandchild), lets the middle process die, and asserts the
// grandchild's watch_parent_death detects the reparenting and self-terminates —
// mirroring the Go test in pkg/grpc/parentwatch_test.go, but with fork(2).
//
// On Windows this file compiles to a no-op success (the watcher is unsupported
// there), matching parent_watch.h's platform gating.
#include <cstdio>
#include <cstdlib>
#include <string>
#include "parent_watch.h"
static int failures = 0;
static void check(bool ok, const std::string &name) {
if (!ok) {
failures++;
fprintf(stderr, "FAIL: %s\n", name.c_str());
} else {
fprintf(stderr, "ok: %s\n", name.c_str());
}
}
// Env-parsing tests are platform-independent and always run.
static void test_env_parsing() {
using namespace llama_grpc;
// Interval: default when unset.
unsetenv("LOCALAI_BACKEND_PARENT_WATCH_INTERVAL");
check(parent_watch_interval_ms() == 2000, "interval default 2000ms");
setenv("LOCALAI_BACKEND_PARENT_WATCH_INTERVAL", "500ms", 1);
check(parent_watch_interval_ms() == 500, "interval 500ms");
setenv("LOCALAI_BACKEND_PARENT_WATCH_INTERVAL", "2s", 1);
check(parent_watch_interval_ms() == 2000, "interval 2s");
setenv("LOCALAI_BACKEND_PARENT_WATCH_INTERVAL", "1m", 1);
check(parent_watch_interval_ms() == 60000, "interval 1m");
setenv("LOCALAI_BACKEND_PARENT_WATCH_INTERVAL", "3", 1); // bare number -> seconds
check(parent_watch_interval_ms() == 3000, "interval bare 3 -> 3000ms");
setenv("LOCALAI_BACKEND_PARENT_WATCH_INTERVAL", "garbage", 1);
check(parent_watch_interval_ms() == 2000, "interval garbage -> default");
unsetenv("LOCALAI_BACKEND_PARENT_WATCH_INTERVAL");
#if !defined(_WIN32)
// Enabled semantics (POSIX only; always false on Windows).
unsetenv("LOCALAI_BACKEND_PARENT_WATCH");
check(parent_watch_enabled(), "enabled by default");
for (const char *falsey : {"false", "0", "no", "off", "OFF", " False "}) {
setenv("LOCALAI_BACKEND_PARENT_WATCH", falsey, 1);
check(!parent_watch_enabled(), std::string("disabled by '") + falsey + "'");
}
setenv("LOCALAI_BACKEND_PARENT_WATCH", "true", 1);
check(parent_watch_enabled(), "enabled by 'true'");
setenv("LOCALAI_BACKEND_PARENT_WATCH", "1", 1);
check(parent_watch_enabled(), "enabled by '1'");
unsetenv("LOCALAI_BACKEND_PARENT_WATCH");
#endif
}
#if !defined(_WIN32)
#include <atomic>
#include <ctime>
#include <sys/stat.h>
#include <sys/wait.h>
#include <unistd.h>
static bool file_exists(const std::string &p) {
struct stat st;
return ::stat(p.c_str(), &st) == 0;
}
static bool wait_for_file(const std::string &p, int timeout_ms) {
int waited = 0;
while (waited < timeout_ms) {
if (file_exists(p)) {
return true;
}
usleep(20 * 1000);
waited += 20;
}
return false;
}
static void write_file(const std::string &p, const std::string &content) {
FILE *f = fopen(p.c_str(), "w");
if (f) {
fwrite(content.data(), 1, content.size(), f);
fclose(f);
}
}
// Builds test -> middle -> grandchild via fork(2). The grandchild arms the REAL
// watch_parent_death against middle; middle exits, orphaning the grandchild;
// the watcher must detect the reparenting and self-terminate.
static void test_detects_reparent() {
char tmpl[] = "/tmp/parentwatch_test_XXXXXX";
char *dir = mkdtemp(tmpl);
if (dir == nullptr) {
check(false, "mkdtemp");
return;
}
const std::string ready_file = std::string(dir) + "/ready";
const std::string exited_file = std::string(dir) + "/exited";
pid_t middle = fork();
if (middle < 0) {
check(false, "fork middle");
return;
}
if (middle == 0) {
// ---- middle process ----
pid_t grandchild = fork();
if (grandchild < 0) {
_exit(4);
}
if (grandchild == 0) {
// ---- grandchild process ----
pid_t orig_ppid = getppid(); // == middle
std::thread([&]() {
llama_grpc::watch_parent_death(orig_ppid, 50 /*ms*/, [&]() {
write_file(exited_file, "1");
_exit(7);
});
}).detach();
// Safety valve: never linger if something goes wrong.
std::thread([]() {
usleep(30 * 1000 * 1000);
_exit(2);
}).detach();
// Signal readiness only after the watcher captured orig_ppid.
write_file(ready_file, std::to_string(getpid()));
for (;;) {
pause();
}
}
// middle: wait until grandchild is ready, then exit to orphan it.
if (!wait_for_file(ready_file, 10000)) {
_exit(5);
}
_exit(0);
}
// ---- test (top) process ----
int status = 0;
waitpid(middle, &status, 0); // reap middle only; grandchild is orphaned
check(file_exists(ready_file), "grandchild signaled readiness");
bool detected = wait_for_file(exited_file, 10000);
check(detected, "watcher detected parent death and self-terminated");
// Best-effort cleanup: kill the grandchild if it somehow survived.
if (file_exists(ready_file)) {
FILE *f = fopen(ready_file.c_str(), "r");
if (f) {
int pid = 0;
if (fscanf(f, "%d", &pid) == 1 && pid > 1) {
kill(pid, SIGKILL);
}
fclose(f);
}
}
unlink(ready_file.c_str());
unlink(exited_file.c_str());
rmdir(dir);
}
#endif // !_WIN32
int main() {
test_env_parsing();
#if !defined(_WIN32)
test_detects_reparent();
#endif
if (failures == 0) {
fprintf(stderr, "\nAll parent_watch tests passed.\n");
return 0;
}
fprintf(stderr, "\n%d parent_watch test(s) failed.\n", failures);
return 1;
}

View File

@@ -22,6 +22,10 @@ cp -r grpc-server.cpp llama.cpp/tools/grpc-server/
# unit test (compiled only when -DLLAMA_GRPC_BUILD_TESTS=ON).
cp -r message_content.h llama.cpp/tools/grpc-server/
cp -r message_content_test.cpp llama.cpp/tools/grpc-server/
# Parent-death watcher (included by grpc-server.cpp) and its standalone unit
# test (run via backend/cpp/run-unit-tests.sh; also buildable under ctest).
cp -r parent_watch.h llama.cpp/tools/grpc-server/
cp -r parent_watch_test.cpp llama.cpp/tools/grpc-server/
cp -rfv llama.cpp/vendor/nlohmann/json.hpp llama.cpp/tools/grpc-server/
cp -rfv llama.cpp/vendor/cpp-httplib/httplib.h llama.cpp/tools/grpc-server/

View File

@@ -36,6 +36,12 @@ else
if [ -d "$CURDIR/lib/rocblas/library" ]; then
export ROCBLAS_TENSILE_LIBPATH="$CURDIR"/lib/rocblas/library
fi
# Same for hipBLASLt (rocblaslt): the bundled libhipblaslt.so resolves its
# TensileLibrary_lazy_gfx*.dat kernel data relative to itself, so point it at
# the bundled data or it falls back to slow generic kernels (issue #10660).
if [ -d "$CURDIR/lib/hipblaslt/library" ]; then
export HIPBLASLT_TENSILE_LIBPATH="$CURDIR"/lib/hipblaslt/library
fi
fi
# If there is a lib/ld.so, use it

View File

@@ -8,7 +8,7 @@
# Local development: point at a working checkout instead of cloning, e.g.
# make PRIVACY_FILTER_SRC=$HOME/c/privacy-filter.cpp grpc-server
PRIVACY_FILTER_VERSION?=595f59630c69d361b5196f2aba2c71c873d0c13c
PRIVACY_FILTER_VERSION?=735a6c28607ee82afc3a670383f41b55266a3b9a
PRIVACY_FILTER_REPO?=https://github.com/localai-org/privacy-filter.cpp
PRIVACY_FILTER_SRC?=

View File

@@ -54,7 +54,7 @@ for test_src in "${tests[@]}"; do
name="$(basename "$test_src" .cpp)"
bin="$(mktemp -d)/$name"
echo "==> $test_src"
if ! "$CXX" -std=c++17 -Wall -Wextra \
if ! "$CXX" -std=c++17 -Wall -Wextra -pthread \
-I"$JSON_INC" -I"$(dirname "$test_src")" \
"$test_src" -o "$bin"; then
echo "COMPILE FAILED: $test_src" >&2

View File

@@ -34,6 +34,12 @@ else
if [ -d "$CURDIR/lib/rocblas/library" ]; then
export ROCBLAS_TENSILE_LIBPATH="$CURDIR"/lib/rocblas/library
fi
# Same for hipBLASLt (rocblaslt): the bundled libhipblaslt.so resolves its
# TensileLibrary_lazy_gfx*.dat kernel data relative to itself, so point it at
# the bundled data or it falls back to slow generic kernels (issue #10660).
if [ -d "$CURDIR/lib/hipblaslt/library" ]; then
export HIPBLASLT_TENSILE_LIBPATH="$CURDIR"/lib/hipblaslt/library
fi
fi
# If there is a lib/ld.so, use it

View File

@@ -25,7 +25,7 @@ target_include_directories(goacestepcpp PRIVATE ${ACESTEP_DIR}/src ${ACESTEP_DIR
target_include_directories(goacestepcpp SYSTEM PRIVATE ${ACESTEP_DIR}/ggml/include)
# Link GPU backends if available (mirrors link_ggml_backends macro)
foreach(backend blas cuda metal vulkan)
foreach(backend blas cuda hip metal vulkan)
if(TARGET ggml-${backend})
target_link_libraries(goacestepcpp PRIVATE ggml-${backend})
string(TOUPPER ${backend} BACKEND_UPPER)

View File

@@ -24,7 +24,14 @@ else ifeq ($(BUILD_TYPE),openblas)
else ifeq ($(BUILD_TYPE),clblas)
CMAKE_ARGS+=-DGGML_CLBLAST=ON -DCLBlast_DIR=/some/path
else ifeq ($(BUILD_TYPE),hipblas)
CMAKE_ARGS+=-DGGML_HIPBLAS=ON
# This ggml only understands GGML_HIP (GGML_HIPBLAS was removed upstream),
# so passing GGML_HIPBLAS silently produced a CPU-only build (see #10666).
ROCM_HOME ?= /opt/rocm
ROCM_PATH ?= /opt/rocm
export CXX=$(ROCM_HOME)/llvm/bin/clang++
export CC=$(ROCM_HOME)/llvm/bin/clang
AMDGPU_TARGETS ?= gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201
CMAKE_ARGS+=-DGGML_HIP=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=ON
else ifeq ($(OS),Darwin)

View File

@@ -142,19 +142,12 @@ func buildAnthropicRequest(opts *pb.PredictOptions, cfg *proxyConfig, stream boo
if req.MaxTokens <= 0 {
req.MaxTokens = anthropicDefaultMaxTokens
}
// Newer Anthropic models 400 when both temperature and top_p are
// set ("`temperature` and `top_p` cannot both be specified for
// this model. Please use only one.") even though their docs only
// "recommend" picking one. The OpenAI-compatible chat UI almost
// always sends both with default values, so prefer temperature
// and drop top_p when both are present.
if t := opts.GetTemperature(); t != 0 {
v := float64(t)
req.Temperature = &v
} else if t := opts.GetTopP(); t != 0 {
v := float64(t)
req.TopP = &v
}
// Do not forward temperature/top_p. Newer Anthropic reasoning models reject
// requests that carry temperature ("`temperature` is deprecated for this
// model"), and the OpenAI-compatible clients typically send only the
// server-side DEFAULT sampling values rather than user intent — dropping
// them loses nothing and lets the upstream apply its own defaults.
_ = opts
req.Tools = convertOpenAITools(opts.GetTools())
req.ToolChoice = convertOpenAIToolChoice(opts.GetToolChoice())

View File

@@ -3,7 +3,6 @@ package main
import (
"encoding/json"
"io"
"math"
"net/http"
"net/http/httptest"
"strings"
@@ -75,15 +74,16 @@ func TestPredict_Anthropic_BasicMessages(t *testing.T) {
g.Expect(captured.Messages).To(HaveLen(1))
g.Expect(captured.Messages[0].Role).To(Equal("user"))
g.Expect(captured.MaxTokens).To(Equal(int32(32)))
g.Expect(captured.Temperature).NotTo(BeNil())
g.Expect(*captured.Temperature).To(Equal(0.5))
// Anthropic 400s when both temperature and top_p are set; the
// translator must prefer temperature and drop top_p.
// Newer Anthropic reasoning models reject requests carrying temperature
// ("`temperature` is deprecated for this model"); clients typically send
// only default sampling values, so the translator forwards neither.
g.Expect(captured.Temperature).To(BeNil())
g.Expect(captured.TopP).To(BeNil())
g.Expect(captured.Stream).To(BeFalse())
}
// When only top_p is set, it should be forwarded.
// Sampling parameters are not forwarded at all — the upstream applies its
// own defaults (newest models reject explicit temperature/top_p).
func TestPredict_Anthropic_TopPOnly(t *testing.T) {
g := NewWithT(t)
srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
@@ -99,11 +99,7 @@ func TestPredict_Anthropic_TopPOnly(t *testing.T) {
})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(captured.Temperature).To(BeNil())
// PredictOptions.TopP is float32 on the wire; the translator widens
// to float64 so 0.9 round-trips as 0.8999999761581421… — compare
// with a small tolerance rather than exact equality.
g.Expect(captured.TopP).NotTo(BeNil())
g.Expect(math.Abs(*captured.TopP - 0.9)).To(BeNumerically("<=", 1e-6))
g.Expect(captured.TopP).To(BeNil())
}
func TestPredict_Anthropic_DefaultsMaxTokens(t *testing.T) {

View File

@@ -30,7 +30,7 @@ type openAIRequest struct {
Stream bool `json:"stream,omitempty"`
Temperature *float64 `json:"temperature,omitempty"`
TopP *float64 `json:"top_p,omitempty"`
MaxTokens *int32 `json:"max_tokens,omitempty"`
MaxTokens *int32 `json:"max_completion_tokens,omitempty"` // newer OpenAI models reject max_tokens ("use max_completion_tokens instead")
Stop []string `json:"stop,omitempty"`
FrequencyPenalty *float64 `json:"frequency_penalty,omitempty"`
PresencePenalty *float64 `json:"presence_penalty,omitempty"`
@@ -107,14 +107,10 @@ func buildOpenAIRequest(opts *pb.PredictOptions, cfg *proxyConfig, stream bool)
Tools: parseRawJSON(opts.GetTools()),
ToolChoice: parseRawJSON(opts.GetToolChoice()),
}
if t := opts.GetTemperature(); t != 0 {
v := float64(t)
req.Temperature = &v
}
if t := opts.GetTopP(); t != 0 {
v := float64(t)
req.TopP = &v
}
// Do not forward temperature/top_p. Newer OpenAI reasoning models reject
// temperature as deprecated, and clients typically send only default
// sampling values rather than user intent — let the upstream apply its
// own defaults.
if n := opts.GetTokens(); n > 0 {
req.MaxTokens = &n
}

View File

@@ -74,8 +74,9 @@ func TestPredict_OpenAI_BasicChat(t *testing.T) {
g.Expect(captured.Messages).To(HaveLen(2))
g.Expect(captured.Messages[0].Role).To(Equal("system"))
g.Expect(captured.Messages[1].Role).To(Equal("user"))
g.Expect(captured.Temperature).NotTo(BeNil())
g.Expect(*captured.Temperature).To(Equal(0.5))
// Sampling parameters are not forwarded (newest models reject explicit
// temperature); token limit is serialized as max_completion_tokens.
g.Expect(captured.Temperature).To(BeNil())
g.Expect(captured.MaxTokens).NotTo(BeNil())
g.Expect(*captured.MaxTokens).To(Equal(int32(32)))
g.Expect(captured.Stream).To(BeFalse())

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# CrispASR version (release tag)
CRISPASR_REPO?=https://github.com/CrispStrobe/CrispASR
CRISPASR_VERSION?=3b93758f9725d400eca82976f895e4cec3f31260
CRISPASR_VERSION?=f35185b876fc482fcb2053a81a2697936ed5fcc0
SO_TARGET?=libgocrispasr.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -30,7 +30,7 @@ target_include_directories(gomnivoicecpp PRIVATE ${OMNIVOICE_DIR}/src)
target_include_directories(gomnivoicecpp SYSTEM PRIVATE ${OMNIVOICE_DIR}/ggml/include)
# Link GPU backends if the upstream ggml created them.
foreach(backend blas cuda metal vulkan sycl)
foreach(backend blas cuda hip metal vulkan sycl)
if(TARGET ggml-${backend})
target_link_libraries(gomnivoicecpp PRIVATE ggml-${backend})
if(backend STREQUAL "cuda")

View File

@@ -24,7 +24,14 @@ else ifeq ($(BUILD_TYPE),openblas)
else ifeq ($(BUILD_TYPE),clblas)
CMAKE_ARGS+=-DGGML_CLBLAST=ON -DCLBlast_DIR=/some/path
else ifeq ($(BUILD_TYPE),hipblas)
CMAKE_ARGS+=-DGGML_HIPBLAS=ON
# This ggml only understands GGML_HIP (GGML_HIPBLAS was removed upstream),
# so passing GGML_HIPBLAS silently produced a CPU-only build (see #10666).
ROCM_HOME ?= /opt/rocm
ROCM_PATH ?= /opt/rocm
export CXX=$(ROCM_HOME)/llvm/bin/clang++
export CC=$(ROCM_HOME)/llvm/bin/clang
AMDGPU_TARGETS ?= gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201
CMAKE_ARGS+=-DGGML_HIP=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=ON
else ifeq ($(OS),Darwin)

View File

@@ -1,6 +1,6 @@
# parakeet-cpp backend Makefile.
#
# Upstream pin lives below as PARAKEET_VERSION?=f469a57270a1cc4554acb15febf60e56619673b9
# Upstream pin lives below as PARAKEET_VERSION?=e8acc6172a94e20a952cf1843decace5d771a94b
# (.github/bump_deps.sh) can find and update it - matches the
# whisper.cpp / ds4 / vibevoice-cpp convention.
#
@@ -15,7 +15,7 @@
# That's what the L0 smoke test uses. The default target below does the
# proper clone-at-pin + cmake build so CI doesn't need a side-checkout.
PARAKEET_VERSION?=f469a57270a1cc4554acb15febf60e56619673b9
PARAKEET_VERSION?=e8acc6172a94e20a952cf1843decace5d771a94b
PARAKEET_REPO?=https://github.com/mudler/parakeet.cpp
GOCMD?=go

View File

@@ -30,7 +30,7 @@ target_include_directories(goqwen3ttscpp PRIVATE ${QWENTTS_DIR}/src)
target_include_directories(goqwen3ttscpp SYSTEM PRIVATE ${QWENTTS_DIR}/ggml/include)
# Link GPU backends if the upstream ggml created them.
foreach(backend blas cuda metal vulkan sycl)
foreach(backend blas cuda hip metal vulkan sycl)
if(TARGET ggml-${backend})
target_link_libraries(goqwen3ttscpp PRIVATE ggml-${backend})
if(backend STREQUAL "cuda")

View File

@@ -24,7 +24,14 @@ else ifeq ($(BUILD_TYPE),openblas)
else ifeq ($(BUILD_TYPE),clblas)
CMAKE_ARGS+=-DGGML_CLBLAST=ON -DCLBlast_DIR=/some/path
else ifeq ($(BUILD_TYPE),hipblas)
CMAKE_ARGS+=-DGGML_HIPBLAS=ON
# This ggml only understands GGML_HIP (GGML_HIPBLAS was removed upstream),
# so passing GGML_HIPBLAS silently produced a CPU-only build (see #10666).
ROCM_HOME ?= /opt/rocm
ROCM_PATH ?= /opt/rocm
export CXX=$(ROCM_HOME)/llvm/bin/clang++
export CC=$(ROCM_HOME)/llvm/bin/clang
AMDGPU_TARGETS ?= gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201
CMAKE_ARGS+=-DGGML_HIP=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=ON
else ifeq ($(OS),Darwin)

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# stablediffusion.cpp (ggml)
STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
STABLEDIFFUSION_GGML_VERSION?=484baa41e5e006c52dcd4addc38c830b9489745f
STABLEDIFFUSION_GGML_VERSION?=2574f5936571645f784b77623e1f09bad97d948a
CMAKE_ARGS+=-DGGML_MAX_NAME=128

View File

@@ -50,7 +50,7 @@ target_include_directories(govibevoicecpp SYSTEM PRIVATE ${VIBEVOICE_DIR}/third_
# Link GPU backends if available — vibevoice's own CMake already links
# these to the libvibevoice STATIC library, but we re-link them on the
# MODULE so resolved symbols include all backend kernels.
foreach(backend blas cuda metal vulkan)
foreach(backend blas cuda hip metal vulkan)
if(TARGET ggml-${backend})
target_link_libraries(govibevoicecpp PRIVATE ggml-${backend})
string(TOUPPER ${backend} BACKEND_UPPER)

View File

@@ -29,7 +29,14 @@ else ifeq ($(BUILD_TYPE),openblas)
else ifeq ($(BUILD_TYPE),clblas)
CMAKE_ARGS+=-DGGML_CLBLAST=ON -DCLBlast_DIR=/some/path
else ifeq ($(BUILD_TYPE),hipblas)
CMAKE_ARGS+=-DGGML_HIPBLAS=ON -DVIBEVOICE_GGML_HIPBLAS=ON
# This ggml only understands GGML_HIP (GGML_HIPBLAS was removed upstream),
# so passing GGML_HIPBLAS silently produced a CPU-only build (see #10666).
ROCM_HOME ?= /opt/rocm
ROCM_PATH ?= /opt/rocm
export CXX=$(ROCM_HOME)/llvm/bin/clang++
export CC=$(ROCM_HOME)/llvm/bin/clang
AMDGPU_TARGETS ?= gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201
CMAKE_ARGS+=-DGGML_HIP=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=ON -DVIBEVOICE_GGML_VULKAN=ON
else ifeq ($(OS),Darwin)

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# whisper.cpp version
WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
WHISPER_CPP_VERSION?=0874de3e8e8e48361dba85c7fe6d176f008bf158
WHISPER_CPP_VERSION?=6fc7c33b4c3a2cec83e4b65abd5e96a890480375
SO_TARGET?=libgowhisper.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -11,6 +11,8 @@ import os
import grpc
from parent_watch import start_parent_death_watcher
class _AbortHandler(grpc.RpcMethodHandler):
"""A method handler that immediately aborts with UNAUTHENTICATED."""
@@ -70,6 +72,13 @@ def get_auth_interceptors(*, aio: bool = False):
Returns an empty list when LOCALAI_GRPC_AUTH_TOKEN is not set.
"""
# Arm the best-effort parent-death backstop here: this is the single helper
# every LocalAI Python backend invokes exactly once while building its gRPC
# server (mirroring how the Go watcher arms in pkg/grpc's shared serve path).
# start_parent_death_watcher() is idempotent and a no-op when disabled or on
# unsupported platforms — see parent_watch.py.
start_parent_death_watcher()
token = os.environ.get("LOCALAI_GRPC_AUTH_TOKEN", "")
if not token:
return []

View File

@@ -20,7 +20,15 @@ def split_reasoning(text, think_start, think_end):
Returns ``(reasoning_content, remaining_text)``. When ``think_start`` is
empty or not found, returns ``("", text)`` unchanged.
"""
if not think_start or not text or think_start not in text:
if not think_start or not text:
return "", text
if think_start not in text:
# Models like Qwen3.5 open assistant turns already INSIDE thinking, so
# the generated text carries only the closing tag. Everything before it
# is reasoning that would otherwise leak into the content.
if think_end and think_end in text:
head, _, tail = text.partition(think_end)
return head.strip(), tail.strip()
return "", text
pattern = re.compile(
re.escape(think_start) + r"(.*?)" + re.escape(think_end or ""),

View File

@@ -0,0 +1,75 @@
"""Unit tests for the mlx/mlx-vlm shared helpers (mlx_utils.py).
Run standalone (Python standard library only, no backend venv needed):
python3 -m unittest mlx_utils_test
These mirror the server-less helper tests in backend/python/mlx/test.py
(TestSharedHelpers), but live here so they run on any platform: the mlx
test module imports grpc/backend_pb2 at import time and needs the MLX venv,
whereas mlx_utils only needs the standard library.
"""
import types
import unittest
from mlx_utils import parse_tool_calls, split_reasoning
class TestSplitReasoning(unittest.TestCase):
def test_both_tags(self):
r, c = split_reasoning(
"<think>step 1\nstep 2</think>The answer is 42.", "<think>", "</think>"
)
self.assertEqual(r, "step 1\nstep 2")
self.assertEqual(c, "The answer is 42.")
def test_implicit_opener_only_closing_tag(self):
# Qwen3.5 opens the assistant turn already inside thinking, so the
# output carries only the closing tag; everything before it is reasoning.
r, c = split_reasoning(
"The user is asking about the weather.\n</think>\n\nThe weather in Rome is sunny.",
"<think>",
"</think>",
)
self.assertEqual(r, "The user is asking about the weather.")
self.assertEqual(c, "The weather in Rome is sunny.")
def test_no_tags_at_all(self):
r, c = split_reasoning("just text", "<think>", "</think>")
self.assertEqual(r, "")
self.assertEqual(c, "just text")
def test_empty_think_end_and_no_opener_match(self):
# No think_end to anchor on, and the opener is absent → return unchanged.
r, c = split_reasoning("no opener here", "<think>", "")
self.assertEqual(r, "")
self.assertEqual(c, "no opener here")
def test_empty_text(self):
r, c = split_reasoning("", "<think>", "</think>")
self.assertEqual(r, "")
self.assertEqual(c, "")
class TestParseToolCalls(unittest.TestCase):
def test_with_shim(self):
tm = types.SimpleNamespace(
tool_call_start="<tool_call>",
tool_call_end="</tool_call>",
parse_tool_call=lambda body, tools: {
"name": "get_weather",
"arguments": {"location": body.strip()},
},
)
calls, remaining = parse_tool_calls(
"Sure: <tool_call>Paris</tool_call>", tm, tools=None
)
self.assertEqual(len(calls), 1)
self.assertEqual(calls[0]["name"], "get_weather")
self.assertEqual(calls[0]["arguments"], '{"location": "Paris"}')
self.assertEqual(calls[0]["index"], 0)
self.assertNotIn("<tool_call>", remaining)
if __name__ == "__main__":
unittest.main()

View File

@@ -0,0 +1,149 @@
"""Parent-death watcher (best-effort backstop) for LocalAI Python backends.
LocalAI spawns each backend as a child process and, on a clean shutdown, tears
it down itself (SIGTERM -> grace -> SIGKILL). That graceful path only runs when
LocalAI receives a catchable signal and lives long enough to run its handlers.
If LocalAI is SIGKILLed (e.g. a supervising process's grace period elapses
first), that teardown never runs and this backend would be reparented to init
and linger, holding GPU/VRAM and its listen port.
The watcher here is a best-effort backstop for exactly that case: it does NOT
replace the graceful teardown, it only covers the "parent vanished without
cleaning up" path. It detects reparenting: when the process that spawned this
backend dies, the kernel reparents us to the nearest sub-reaper or to init
(PID 1), so os.getppid() stops matching the value captured at startup. This
getppid() approach is portable across Linux/macOS (unlike the Linux-only
PR_SET_PDEATHSIG), which is why it is used here, mirroring the Go backends'
pkg/grpc/parentwatch.go and the C++ backends' parent_watch.h. It is disabled on
Windows, which has no equivalent orphan-reparenting semantics.
Env vars (shared verbatim across the Go, C++ and Python backends):
LOCALAI_BACKEND_PARENT_WATCH enabled by default; a falsey value
("false"/"0"/"no"/"off", case-insensitive)
disables it.
LOCALAI_BACKEND_PARENT_WATCH_INTERVAL poll interval as a Go-style duration
string ("500ms", "2s", "1m") or a bare
number of seconds. Defaults to 2s.
"""
import os
import sys
import threading
ENV_PARENT_WATCH = "LOCALAI_BACKEND_PARENT_WATCH"
ENV_PARENT_WATCH_INTERVAL = "LOCALAI_BACKEND_PARENT_WATCH_INTERVAL"
_DEFAULT_INTERVAL_SECONDS = 2.0
# Guard so repeated calls (e.g. get_auth_interceptors invoked more than once)
# only ever arm a single watcher thread per process.
_started = False
_started_lock = threading.Lock()
def _enabled():
"""Report whether the watcher should run in this process."""
# Windows does not reparent orphans to a well-known init PID, so the
# getppid() heuristic used here doesn't apply there.
if os.name == "nt" or sys.platform.startswith("win"):
return False
val = os.environ.get(ENV_PARENT_WATCH, "").strip().lower()
if val in ("false", "0", "no", "off"):
return False
return True
def _interval_seconds():
"""Return the configured poll interval in seconds, or the default.
Accepts Go-style duration strings ("500ms", "2s", "1m") for cross-language
parity, or a bare number interpreted as seconds.
"""
raw = os.environ.get(ENV_PARENT_WATCH_INTERVAL, "").strip()
if not raw:
return _DEFAULT_INTERVAL_SECONDS
# Split numeric prefix from unit suffix.
i = 0
while i < len(raw) and (raw[i].isdigit() or raw[i] == "." or (i == 0 and raw[i] in "+-")):
i += 1
if i == 0:
return _DEFAULT_INTERVAL_SECONDS
try:
num = float(raw[:i])
except ValueError:
return _DEFAULT_INTERVAL_SECONDS
unit = raw[i:].lower()
if unit == "ms":
seconds = num / 1000.0
elif unit in ("s", ""):
seconds = num
elif unit == "m":
seconds = num * 60.0
else:
return _DEFAULT_INTERVAL_SECONDS
return seconds if seconds > 0 else _DEFAULT_INTERVAL_SECONDS
def _parent_died(orig_ppid):
"""Report whether this process has been reparented away from orig_ppid.
Reparenting is the standard POSIX signal that the original parent (here, the
LocalAI process that spawned this backend) has exited: the orphan is handed
to the nearest sub-reaper or to init (PID 1), so os.getppid() no longer
matches the value captured at startup.
"""
ppid = os.getppid()
return ppid != orig_ppid or ppid == 1
def _watch(orig_ppid, interval, on_death):
"""Poll until _parent_died reports the original parent is gone, then call
on_death. Blocks, so run it on its own (daemon) thread."""
import time
while True:
time.sleep(interval)
if _parent_died(orig_ppid):
on_death()
return
def start_parent_death_watcher():
"""Install the best-effort safety net described in this module's docstring.
No-op when disabled, on Windows, when already orphaned at startup
(os.getppid() <= 1), or if already started. This is a backstop alongside —
never a replacement for — LocalAI's graceful teardown.
"""
global _started
if not _enabled():
return
with _started_lock:
if _started:
return
orig_ppid = os.getppid()
# A parent of 1 (or less) at startup means we were already orphaned (or
# launched directly under init) — there is no original parent to watch.
if orig_ppid <= 1:
return
interval = _interval_seconds()
def on_death():
print(
"backend parent process (pid {}) exited without stopping this "
"backend; self-terminating to avoid orphaning".format(orig_ppid),
file=sys.stderr,
flush=True,
)
# Immediate, non-cleanup exit: this is a shutdown safety net and the
# normal graceful path is already gone.
os._exit(1)
thread = threading.Thread(
target=_watch,
args=(orig_ppid, interval, on_death),
name="parent-death-watcher",
daemon=True,
)
thread.start()
_started = True

View File

@@ -0,0 +1,150 @@
"""Unit tests for the parent-death watcher (parent_watch.py).
Run standalone (Python standard library only, no backend venv needed):
python3 -m unittest parent_watch_test
The core test (test_detects_reparent) builds a genuine two-level process tree
(test -> middle -> grandchild) with os.fork, lets the middle process die, and
asserts the grandchild's parent_watch._watch detects the reparenting and
self-terminates — mirroring the Go test in pkg/grpc/parentwatch_test.go and the
C++ test in backend/cpp/llama-cpp/parent_watch_test.cpp.
"""
import os
import sys
import tempfile
import threading
import time
import unittest
import parent_watch
class TestParentWatchEnvParsing(unittest.TestCase):
def setUp(self):
self._saved = {
k: os.environ.get(k)
for k in (parent_watch.ENV_PARENT_WATCH, parent_watch.ENV_PARENT_WATCH_INTERVAL)
}
for k in self._saved:
os.environ.pop(k, None)
def tearDown(self):
for k, v in self._saved.items():
if v is None:
os.environ.pop(k, None)
else:
os.environ[k] = v
def test_interval_default(self):
self.assertEqual(parent_watch._interval_seconds(), 2.0)
def test_interval_units(self):
cases = {"500ms": 0.5, "2s": 2.0, "1m": 60.0, "3": 3.0, "0.5s": 0.5}
for raw, expected in cases.items():
os.environ[parent_watch.ENV_PARENT_WATCH_INTERVAL] = raw
self.assertAlmostEqual(parent_watch._interval_seconds(), expected, msg=raw)
def test_interval_garbage_falls_back(self):
os.environ[parent_watch.ENV_PARENT_WATCH_INTERVAL] = "garbage"
self.assertEqual(parent_watch._interval_seconds(), 2.0)
@unittest.skipIf(os.name == "nt" or sys.platform.startswith("win"), "POSIX only")
def test_enabled_default(self):
self.assertTrue(parent_watch._enabled())
@unittest.skipIf(os.name == "nt" or sys.platform.startswith("win"), "POSIX only")
def test_disabled_by_falsey(self):
for val in ("false", "0", "no", "off", "OFF", " False "):
os.environ[parent_watch.ENV_PARENT_WATCH] = val
self.assertFalse(parent_watch._enabled(), msg=val)
@unittest.skipIf(os.name == "nt" or sys.platform.startswith("win"), "POSIX only")
def test_enabled_by_truthy(self):
for val in ("true", "1", "yes", "on"):
os.environ[parent_watch.ENV_PARENT_WATCH] = val
self.assertTrue(parent_watch._enabled(), msg=val)
@unittest.skipIf(os.name == "nt" or sys.platform.startswith("win"), "fork/reparent is POSIX only")
class TestParentWatchReparent(unittest.TestCase):
def _wait_for_file(self, path, timeout=10.0):
deadline = time.time() + timeout
while time.time() < deadline:
if os.path.exists(path):
return True
time.sleep(0.02)
return False
def test_detects_reparent(self):
tmpdir = tempfile.mkdtemp(prefix="parentwatch_test_")
ready_file = os.path.join(tmpdir, "ready")
exited_file = os.path.join(tmpdir, "exited")
middle = os.fork()
if middle == 0:
# ---- middle process ----
grandchild = os.fork()
if grandchild == 0:
# ---- grandchild process: arm the REAL watcher against middle ----
orig_ppid = os.getppid()
def on_death():
with open(exited_file, "w") as f:
f.write("1")
os._exit(7)
threading.Thread(
target=parent_watch._watch,
args=(orig_ppid, 0.05, on_death),
daemon=True,
).start()
# Safety valve: never linger if something goes wrong.
def bail():
time.sleep(30)
os._exit(2)
threading.Thread(target=bail, daemon=True).start()
# Signal readiness only after the watcher captured orig_ppid.
with open(ready_file, "w") as f:
f.write(str(os.getpid()))
while True:
time.sleep(1)
else:
# middle: wait until grandchild is ready, then exit to orphan it.
if not self._wait_for_file(ready_file):
os._exit(5)
os._exit(0)
# ---- test (top) process ----
os.waitpid(middle, 0) # reap middle only; grandchild is orphaned
self.assertTrue(os.path.exists(ready_file), "grandchild never signaled readiness")
self.assertTrue(
self._wait_for_file(exited_file),
"watcher did not detect parent death within timeout",
)
# Best-effort cleanup: kill the grandchild if it somehow survived.
try:
with open(ready_file) as f:
pid = int(f.read().strip())
if pid > 1:
os.kill(pid, 9)
except (OSError, ValueError):
pass
for p in (ready_file, exited_file):
try:
os.remove(p)
except OSError:
pass
try:
os.rmdir(tmpdir)
except OSError:
pass
if __name__ == "__main__":
unittest.main()

View File

@@ -58,7 +58,18 @@ def messages_to_dicts(proto_messages):
d["reasoning_content"] = msg.reasoning_content
if msg.tool_calls:
try:
d["tool_calls"] = json.loads(msg.tool_calls)
tool_calls = json.loads(msg.tool_calls)
# Chat templates (e.g. Qwen) iterate function.arguments as a
# mapping, but the OpenAI wire format carries it as a JSON
# string — decode it back so the template's .items() works.
for tc in tool_calls:
fn = tc.get("function") if isinstance(tc, dict) else None
if isinstance(fn, dict) and isinstance(fn.get("arguments"), str):
try:
fn["arguments"] = json.loads(fn["arguments"])
except json.JSONDecodeError:
pass
d["tool_calls"] = tool_calls
except json.JSONDecodeError:
pass
result.append(d)

View File

@@ -0,0 +1,122 @@
"""Unit tests for the shared python backend helpers (python_utils.py).
Run standalone (Python standard library only, no backend venv needed):
python3 -m unittest python_utils_test
These mirror the server-less helper tests in backend/python/mlx/test.py
(TestSharedHelpers), but live here so they run on any platform: the mlx
test module imports grpc/backend_pb2 at import time and needs the MLX venv,
whereas python_utils has no third-party dependency. Proto Message objects
are faked with types.SimpleNamespace (real proto fields default to "").
"""
import json
import types
import unittest
from python_utils import messages_to_dicts, parse_options
def _msg(**fields):
"""Fake a proto Message: every unset field is the empty string, as protobuf."""
defaults = {
"role": "",
"content": "",
"name": "",
"tool_call_id": "",
"reasoning_content": "",
"tool_calls": "",
}
defaults.update(fields)
return types.SimpleNamespace(**defaults)
class TestParseOptions(unittest.TestCase):
def test_type_inference(self):
opts = parse_options(
["temperature:0.7", "max_tokens:128", "trust:true", "name:hello", "no_colon_skipped"]
)
self.assertEqual(opts["temperature"], 0.7)
self.assertEqual(opts["max_tokens"], 128)
self.assertIs(opts["trust"], True)
self.assertEqual(opts["name"], "hello")
self.assertNotIn("no_colon_skipped", opts)
class TestMessagesToDicts(unittest.TestCase):
def test_basic_fields(self):
out = messages_to_dicts(
[
_msg(role="user", content="hi"),
_msg(role="tool", content="42", tool_call_id="call_1", name="f"),
]
)
self.assertEqual(out[0], {"role": "user", "content": "hi"})
self.assertEqual(out[1]["tool_call_id"], "call_1")
self.assertEqual(out[1]["name"], "f")
def test_tool_call_arguments_string_decoded_to_mapping(self):
# OpenAI wire format ships function.arguments as a JSON *string*; chat
# templates iterate it as a mapping, so it must come back as a dict.
out = messages_to_dicts(
[
_msg(
role="assistant",
tool_calls=json.dumps(
[
{
"id": "call_1",
"type": "function",
"function": {
"name": "get_weather",
"arguments": '{"location": "Rome"}',
},
}
]
),
)
]
)
args = out[0]["tool_calls"][0]["function"]["arguments"]
self.assertEqual(args, {"location": "Rome"})
self.assertEqual(dict(args.items()), {"location": "Rome"})
def test_tool_call_arguments_already_mapping_is_idempotent(self):
out = messages_to_dicts(
[
_msg(
role="assistant",
tool_calls=json.dumps(
[{"function": {"name": "f", "arguments": {"a": 1}}}]
),
)
]
)
self.assertEqual(out[0]["tool_calls"][0]["function"]["arguments"], {"a": 1})
def test_tool_call_arguments_invalid_json_left_as_string(self):
out = messages_to_dicts(
[
_msg(
role="assistant",
tool_calls=json.dumps(
[{"function": {"name": "f", "arguments": "not-json"}}]
),
)
]
)
self.assertEqual(out[0]["tool_calls"][0]["function"]["arguments"], "not-json")
def test_tool_call_without_function_key(self):
out = messages_to_dicts(
[_msg(role="assistant", tool_calls=json.dumps([{"id": "call_1"}]))]
)
self.assertEqual(out[0]["tool_calls"], [{"id": "call_1"}])
def test_tool_calls_invalid_json_dropped(self):
out = messages_to_dicts([_msg(role="assistant", tool_calls="{not json")])
self.assertNotIn("tool_calls", out[0])
if __name__ == "__main__":
unittest.main()

View File

@@ -748,7 +748,12 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
# When (A) native streaming ran cleanly, per-delta yields above already
# delivered everything — do NOT extract again on the full text or we'd
# duplicate content/tool_calls into the final chunk.
if has_tool_parser and not (native_streaming and not native_streaming_error):
# NOTE: `native_streaming` is a capability flag ("streaming parser is
# available"), not a state flag ("streaming actually ran"). For
# non-streaming requests it is still True but the per-delta loop was
# never entered, so we MUST still run extract_tool_calls here. Hence
# the explicit `streaming and …` guard on both branches.
if has_tool_parser and not (streaming and native_streaming and not native_streaming_error):
try:
tp = tp_instance
if tp is None:
@@ -770,7 +775,7 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
))
except Exception as e:
print(f"Tool parser error: {e}", file=sys.stderr)
elif native_streaming and not native_streaming_error:
elif streaming and native_streaming and not native_streaming_error:
# Per-delta path already emitted content + tool_calls; the final
# chat_delta should carry only metadata (token counts, logprobs).
content = ""

View File

@@ -35,6 +35,21 @@ if [ "x${BUILD_PROFILE}" == "xcpu" ]; then
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
fi
# AMD ROCm: vLLM ships prebuilt ROCm wheels, but on a DEDICATED index
# (https://wheels.vllm.ai/rocm/), NOT PyPI, and ONLY for CPython 3.12. On any
# other Python the installer silently falls back to the CUDA-only PyPI wheel,
# which is unusable on an AMD GPU (import fails, so the backend never finds the
# vllm module). Force Python 3.12 before the venv is created (matches the
# intel/l4t13 cp312 bump); the hipblas branch below pulls vllm from the ROCm
# wheel index. unsafe-best-match lets uv consult that index and PyPI together.
# https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html?device=rocm
if [ "x${BUILD_TYPE}" == "xhipblas" ]; then
PYTHON_VERSION="3.12"
PYTHON_PATCH="12"
PY_STANDALONE_TAG="20251120"
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
fi
# cublas13 pulls the vLLM wheel from a per-tag cu130 index (PyPI's vllm wheel
# is built against CUDA 12 and won't load on cu130). uv's default per-package
# first-match strategy would still pick the PyPI wheel, so allow it to consult
@@ -104,7 +119,7 @@ if [ "$(uname -s)" = "Darwin" ]; then
# can rewrite it. Darwin therefore follows vllm-metal and can lag the Linux
# vllm pin (requirements-cublas13-after.txt, bumped independently against
# vllm/vllm) until vllm-metal supports a newer vLLM.
VLLM_METAL_VERSION="v0.3.0.dev20260628073537"
VLLM_METAL_VERSION="v0.3.0.dev20260701212152"
# The coupled vLLM source version is whatever this vllm-metal release builds
# against -- it declares it in its own installer as `vllm_v=`. Derive it from
@@ -194,6 +209,22 @@ elif [ "x${BUILD_TYPE}" == "xintel" ]; then
export CMAKE_PREFIX_PATH="$(python -c 'import site; print(site.getsitepackages()[0])'):${CMAKE_PREFIX_PATH:-}"
VLLM_TARGET_DEVICE=xpu uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --no-deps .
popd
# AMD ROCm: install vllm from its dedicated ROCm wheel index instead of the
# CUDA-only PyPI wheel. installRequirements brings the base ROCm
# torch/transformers (requirements-hipblas.txt), then we pull vllm (plus the
# matching ROCm torch, via --upgrade) from wheels.vllm.ai/rocm. This is the
# method upstream prescribes for AMD; the Python-3.12 pin is set above.
# There is intentionally no requirements-hipblas-after.txt: a bare `vllm`
# there would resolve to the CUDA wheel, and installRequirements never loads
# a ${BUILD_TYPE}-after file for hipblas anyway (BUILD_TYPE == BUILD_PROFILE).
# https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html?device=rocm
elif [ "x${BUILD_TYPE}" == "xhipblas" ]; then
installRequirements
# --upgrade reconciles the base ROCm torch to whatever the vllm ROCm wheel
# pins; --extra-index-url adds the ROCm wheel repository on top of PyPI.
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} \
--extra-index-url https://wheels.vllm.ai/rocm/ --upgrade vllm
# FROM_SOURCE=true on a CPU build skips the prebuilt vllm wheel in
# requirements-cpu-after.txt and compiles vllm locally against the host's
# actual CPU. Not used by default because it takes ~30-40 minutes, but

View File

@@ -1 +0,0 @@
vllm

View File

@@ -356,6 +356,12 @@ func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB, configLoade
PrefixConfig: prefixCfg,
Pressure: pressure,
SharedModels: cfg.Distributed.SharedModels,
// Cap how long a cold load may hold the per-model advisory lock: the
// configured backend.install deadline plus a margin for file staging and
// the remote LoadModel. Derived from the install timeout so raising it
// (for slow links pulling multi-GB images) widens the ceiling too,
// instead of letting the static default cut a legitimately slow load.
ModelLoadCeiling: cfg.Distributed.BackendInstallTimeoutOrDefault() + 10*time.Minute,
})
// Wire staging-progress broadcasting so file-staging shows up on every

View File

@@ -369,7 +369,7 @@ func New(opts ...config.AppOption) (*Application, error) {
}
for _, backend := range options.ExternalBackends {
if err := galleryop.InstallExternalBackend(options.Context, options.BackendGalleries, options.SystemState, application.ModelLoader(), nil, backend, "", "", options.RequireBackendIntegrity); err != nil {
if err := galleryop.InstallExternalBackend(options.Context, options.BackendGalleries, options.SystemState, application.ModelLoader(), nil, backend, "", "", false, options.RequireBackendIntegrity); err != nil {
xlog.Error("error installing external backend", "error", err)
}
}
@@ -473,20 +473,13 @@ func New(opts ...config.AppOption) (*Application, error) {
if options.LoadToMemory != nil && !options.SingleBackend {
for _, m := range options.LoadToMemory {
cfg, err := application.ModelConfigLoader().LoadModelConfigFileByNameDefaultOptions(m, options)
if err != nil {
xlog.Debug("Auto loading model into memory from file", "model", m)
// Same path as POST /backend/load: a realtime pipeline model expands
// to its sub-models, and load failures are recorded as model_load
// traces.
if _, err := backend.PreloadModelByName(options.Context, application.ModelConfigLoader(), application.ModelLoader(), options, m); err != nil {
return nil, err
}
xlog.Debug("Auto loading model into memory from file", "model", m, "file", cfg.Model)
o := backend.ModelOptions(*cfg, options)
var backendErr error
_, backendErr = application.ModelLoader().Load(o...)
if backendErr != nil {
return nil, backendErr
}
}
}

View File

@@ -52,6 +52,22 @@ func ModelLoadTraceObserver(appConfig *config.ApplicationConfig) func(model.Back
}
}
// PreloadModel warms a model into memory without running any inference, so the
// first real request doesn't pay the backend's cold-start load cost. It uses
// the same ModelOptions + ml.Load path the modality functions use, so a
// subsequent inference call hits the loader cache instead of reloading. Load
// failures are recorded and returned; callers that warm models opportunistically
// (e.g. realtime session warm-up) typically log and continue, since the lazy
// path will retry on first use.
func PreloadModel(ctx context.Context, ml *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) error {
opts := ModelOptions(modelConfig, appConfig, model.WithContext(ctx))
if _, err := ml.Load(opts...); err != nil {
recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)
return err
}
return nil
}
// recordModelLoadFailure records a backend trace when model loading fails.
func recordModelLoadFailure(appConfig *config.ApplicationConfig, modelName, backend string, err error, data map[string]any) {
if !appConfig.EnableTracing {

122
core/backend/preload.go Normal file
View File

@@ -0,0 +1,122 @@
package backend
import (
"context"
"errors"
"fmt"
"sync"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/xlog"
)
// PreloadModelByName loads the named model into memory so the first request
// that uses it pays no cold-start load cost — the inverse of shutting a model
// down. If the model is a realtime pipeline (its config declares a `pipeline:`
// block), each configured sub-model (VAD, transcription, LLM, TTS,
// sound_detection, voice_recognition) is loaded concurrently instead of the
// pipeline stub, which has no backend of its own. It returns the model names
// actually loaded and a joined error naming each sub-model that failed (nil on
// full success); a partial pipeline load reports both the loaded names and the
// failures so the caller can surface exactly what is and isn't resident.
// Compaction's summary_model is deliberately left cold: it is only invoked off
// the response path, so it can stay lazy.
func PreloadModelByName(ctx context.Context, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig, name string) ([]string, error) {
cfg, err := cl.LoadModelConfigFileByNameDefaultOptions(name, appConfig)
if err != nil {
return nil, err
}
stages, err := pipelineStages(cl, &cfg.Pipeline, ml.ModelPath)
if err != nil {
return nil, err
}
if len(stages) == 0 {
// Not a pipeline: load the model's own backend directly.
if err := PreloadModel(ctx, ml, *cfg, appConfig); err != nil {
return nil, err
}
return []string{cfg.Name}, nil
}
return PreloadStages(ctx, ml, appConfig, stages)
}
// PreloadStage names one pipeline sub-model to preload and the resolved config
// to load it from (nil = stage absent, skipped). Role labels the pipeline slot
// in errors and logs.
type PreloadStage struct {
Role string
Cfg *config.ModelConfig
}
// loadStage is PreloadModel behind a seam so PreloadStages can be unit-tested
// without spawning real backends.
var loadStage = PreloadModel
// pipelineStages resolves each populated pipeline stage to its concrete model
// config, following a single alias hop — the same resolution the realtime
// pipeline itself uses. A stage that fails to resolve is a misconfiguration,
// so it fails fast rather than being deferred to load. A pipeline with no
// stages set returns nil, which callers treat as "not a pipeline".
func pipelineStages(cl *config.ModelConfigLoader, p *config.Pipeline, modelPath string) ([]PreloadStage, error) {
voiceRec := ""
if p.VoiceRecognition != nil {
voiceRec = p.VoiceRecognition.Model
}
var stages []PreloadStage
for _, s := range []struct{ role, name string }{
{"vad", p.VAD},
{"transcription", p.Transcription},
{"llm", p.LLM},
{"tts", p.TTS},
{"sound_detection", p.SoundDetection},
{"voice_recognition", voiceRec},
} {
if s.name == "" {
continue
}
cfg, err := cl.LoadResolvedModelConfig(s.name, modelPath)
if err != nil {
return nil, fmt.Errorf("%s (%s): %w", s.role, s.name, err)
}
stages = append(stages, PreloadStage{Role: s.role, Cfg: cfg})
}
return stages, nil
}
// PreloadStages loads every present stage at once and waits for all of them, so
// a pipeline warms in the time of its slowest stage rather than the sum. Absent
// (nil-config) stages are skipped. A failed stage does not cancel the others —
// they all run to completion so the joined error names every broken stage at
// once, alongside the names that did load.
func PreloadStages(ctx context.Context, ml *model.ModelLoader, appConfig *config.ApplicationConfig, stages []PreloadStage) ([]string, error) {
var (
wg sync.WaitGroup
mu sync.Mutex
loaded []string
errs []error
)
for _, s := range stages {
if s.Cfg == nil {
continue
}
wg.Add(1)
go func(s PreloadStage) {
defer wg.Done()
if err := loadStage(ctx, ml, *s.Cfg, appConfig); err != nil {
xlog.Warn("preload: failed to load pipeline sub-model", "stage", s.Role, "model", s.Cfg.Name, "error", err)
mu.Lock()
errs = append(errs, fmt.Errorf("%s (%s): %w", s.Role, s.Cfg.Name, err))
mu.Unlock()
return
}
xlog.Debug("preload: loaded pipeline sub-model", "stage", s.Role, "model", s.Cfg.Name)
mu.Lock()
loaded = append(loaded, s.Cfg.Name)
mu.Unlock()
}(s)
}
wg.Wait()
return loaded, errors.Join(errs...)
}

View File

@@ -0,0 +1,146 @@
package backend
import (
"context"
"errors"
"os"
"path/filepath"
"sync"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/model"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("pipelineStages", func() {
seed := func(dir string, names ...string) *config.ModelConfigLoader {
for _, n := range names {
yaml := "name: " + n + "\nbackend: fake-backend\n"
Expect(os.WriteFile(filepath.Join(dir, n+".yaml"), []byte(yaml), 0o644)).To(Succeed())
}
cl := config.NewModelConfigLoader(dir)
Expect(cl.LoadModelConfigsFromPath(dir)).To(Succeed())
return cl
}
It("resolves only the populated stages, in load order", func() {
dir := GinkgoT().TempDir()
cl := seed(dir, "vad-m", "stt-m", "llm-m", "tts-m")
stages, err := pipelineStages(cl, &config.Pipeline{
VAD: "vad-m",
Transcription: "stt-m",
LLM: "llm-m",
TTS: "tts-m",
}, dir)
Expect(err).ToNot(HaveOccurred())
roles := make([]string, len(stages))
names := make([]string, len(stages))
for i, s := range stages {
roles[i] = s.Role
names[i] = s.Cfg.Name
}
Expect(roles).To(Equal([]string{"vad", "transcription", "llm", "tts"}))
Expect(names).To(Equal([]string{"vad-m", "stt-m", "llm-m", "tts-m"}))
})
It("skips unset stages and includes sound_detection and voice_recognition when set", func() {
dir := GinkgoT().TempDir()
cl := seed(dir, "stt-m", "ced", "spk")
stages, err := pipelineStages(cl, &config.Pipeline{
Transcription: "stt-m",
SoundDetection: "ced",
VoiceRecognition: &config.PipelineVoiceRecognition{Model: "spk"},
}, dir)
Expect(err).ToNot(HaveOccurred())
roles := make([]string, len(stages))
for i, s := range stages {
roles[i] = s.Role
}
Expect(roles).To(ConsistOf("transcription", "sound_detection", "voice_recognition"))
})
It("returns nil for a pipeline with no stages (not a pipeline)", func() {
dir := GinkgoT().TempDir()
cl := seed(dir)
stages, err := pipelineStages(cl, &config.Pipeline{}, dir)
Expect(err).ToNot(HaveOccurred())
Expect(stages).To(BeNil())
})
})
var _ = Describe("PreloadStages", func() {
var (
mu sync.Mutex
seen []string
)
// stubLoader swaps the loadStage seam for a recorder so no real backends
// are spawned; errFor injects per-model failures.
stubLoader := func(errFor map[string]error) {
loadStage = func(_ context.Context, _ *model.ModelLoader, cfg config.ModelConfig, _ *config.ApplicationConfig) error {
mu.Lock()
seen = append(seen, cfg.Name)
mu.Unlock()
return errFor[cfg.Name]
}
}
BeforeEach(func() {
seen = nil
})
AfterEach(func() {
loadStage = PreloadModel
})
mkStage := func(role, name string) PreloadStage {
return PreloadStage{Role: role, Cfg: &config.ModelConfig{Name: name}}
}
It("loads every present stage, skips absent (nil-config) ones, and returns the loaded names", func() {
stubLoader(nil)
loaded, err := PreloadStages(context.Background(), nil, nil, []PreloadStage{
mkStage("vad", "vad-m"),
{Role: "transcription"}, // absent stage
mkStage("llm", "llm-m"),
})
Expect(err).ToNot(HaveOccurred())
Expect(loaded).To(ConsistOf("vad-m", "llm-m"))
// Barrier: every stage has run by the time PreloadStages returns, so
// reading seen without the lock here is safe.
Expect(seen).To(ConsistOf("vad-m", "llm-m"))
})
It("reports a joined error naming each failed stage while still loading the rest", func() {
stubLoader(map[string]error{
"vad-m": errors.New("vad boom"),
"tts-m": errors.New("tts boom"),
})
loaded, err := PreloadStages(context.Background(), nil, nil, []PreloadStage{
mkStage("vad", "vad-m"),
mkStage("llm", "llm-m"),
mkStage("tts", "tts-m"),
})
// Every stage ran (a failure does not cancel the others)...
Expect(seen).To(ConsistOf("vad-m", "llm-m", "tts-m"))
// ...the stage that loaded fine is reported as loaded...
Expect(loaded).To(ConsistOf("llm-m"))
// ...and the joined error names every broken stage and its cause.
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("vad (vad-m)"))
Expect(err.Error()).To(ContainSubstring("vad boom"))
Expect(err.Error()).To(ContainSubstring("tts (tts-m)"))
Expect(err.Error()).To(ContainSubstring("tts boom"))
Expect(err.Error()).ToNot(ContainSubstring("llm"))
})
})

View File

@@ -127,7 +127,7 @@ func (bi *BackendsInstall) Run(ctx *cliContext.Context) error {
}
modelLoader := model.NewModelLoader(systemState)
err = galleryop.InstallExternalBackend(context.Background(), galleries, systemState, modelLoader, progressCallback, bi.BackendArgs, bi.Name, bi.Alias, bi.RequireBackendIntegrity)
err = galleryop.InstallExternalBackend(context.Background(), galleries, systemState, modelLoader, progressCallback, bi.BackendArgs, bi.Name, bi.Alias, false, bi.RequireBackendIntegrity)
if err != nil {
return err
}

View File

@@ -67,16 +67,6 @@ func guessGGUFFromFile(cfg *ModelConfig, f *gguf.GGUFFile, defaultCtx int) {
ApplyMTPDefaults(cfg, n)
}
// Sliding-window-attention models (Gemma 2/3, Cohere2, Llama 4, ...) ship
// with a reduced SWA KV cache by default, which cannot reuse a prompt
// prefix across requests and so defeats the cross-request prefix cache
// (cache_reuse) we enable in serving_defaults.go. Enable the full SWA cache
// for these models so the prefix survives; skipped for dense models and
// when the user already pinned an SWA cache option.
if w, ok := HasSlidingWindowAttention(f); ok {
ApplySWAFullDefault(cfg, w)
}
// Thinking support detection is done after model load via DetectThinkingSupportFromBackend
// template estimations

View File

@@ -599,6 +599,13 @@ func DefaultRegistry() map[string]FieldMetaOverride {
Component: "toggle",
Order: 89,
},
"pipeline.disable_warmup": {
Section: "pipeline",
Label: "Disable Warmup",
Description: "Turn off eager pre-loading of the pipeline's sub-models at realtime session start. By default LocalAI loads every configured sub-model backend (VAD, transcription, LLM, TTS, sound detection, voice recognition) before the session starts and blocks until they are ready, so the first turn pays no cold-start cost and a model that fails to load is reported at session start instead of mid-call. Enable this to restore the lazy 'load on first use' behavior — session start no longer waits on loading and load errors surface on the first turn instead. Useful to keep idle sessions from holding model memory they may never use.",
Component: "toggle",
Order: 90,
},
// --- Functions ---
"function.grammar.parallel_calls": {

View File

@@ -656,6 +656,18 @@ type Pipeline struct {
// to benefit. A client session.update still overrides type and eagerness
// per session; retranscribe is server-side only. Unset keeps server_vad.
TurnDetection PipelineTurnDetection `yaml:"turn_detection,omitempty" json:"turn_detection,omitempty"`
// DisableWarmup turns off eager pre-loading of the pipeline's sub-models at
// realtime session start. By default (false) LocalAI loads every configured
// sub-model backend (VAD, transcription, LLM, TTS, sound detection, voice
// recognition) into memory (concurrently) before the
// session is announced and blocks until they are ready, so the first turn
// pays no cold-start cost and a model that fails to load surfaces as an error
// at session start rather than mid-call. Set true to restore the lazy "load
// on first use" behavior — session start no longer blocks on loading and
// load errors surface on first use instead (e.g. to keep idle sessions from
// holding model memory they may never use).
DisableWarmup bool `yaml:"disable_warmup,omitempty" json:"disable_warmup,omitempty"`
}
// PipelineCompaction configures summarize-then-drop for a realtime pipeline.

View File

@@ -155,6 +155,25 @@ func (bcl *ModelConfigLoader) LoadModelConfigFileByNameDefaultOptions(modelName
ModelPath(appConfig.SystemState.Model.ModelsPath))
}
// LoadResolvedModelConfig loads a model config by name and follows a single
// alias hop, so a caller that references an alias (e.g. a pipeline with
// `llm: default`) gets the alias target's full config (Backend, Model, ...)
// rather than the alias stub with an empty Backend. Without this the alias
// survives unresolved into model loading and fails downstream — notably in
// distributed mode with "backend name is empty". Mirrors the top-level alias
// resolution in core/http/middleware/request.go.
func (bcl *ModelConfigLoader) LoadResolvedModelConfig(modelName, modelPath string) (*ModelConfig, error) {
cfg, err := bcl.LoadModelConfigFileByName(modelName, modelPath)
if err != nil {
return nil, err
}
resolved, _, err := bcl.ResolveAlias(cfg)
if err != nil {
return nil, err
}
return resolved, nil
}
// This format is currently only used when reading a single file at startup, passed in via ApplicationConfig.ConfigFile
func (bcl *ModelConfigLoader) LoadMultipleModelConfigsSingleFile(file string, opts ...ConfigLoaderOption) error {
bcl.Lock()

View File

@@ -1,4 +1,4 @@
package openai
package config_test
import (
"os"
@@ -10,14 +10,14 @@ import (
"github.com/mudler/LocalAI/core/config"
)
// loadPipelineSubModel must resolve a pipeline sub-model that references an
// alias (e.g. `llm: default`) one hop to the alias target's full config — so
// the effective backend is the target's backend, not the empty backend of the
// alias stub. This mirrors the top-level alias resolution done in
// core/http/middleware/request.go, which the realtime pipeline previously
// LoadResolvedModelConfig must resolve a model that references an alias
// (e.g. a pipeline with `llm: default`) one hop to the alias target's full
// config — so the effective backend is the target's backend, not the empty
// backend of the alias stub. This mirrors the top-level alias resolution done
// in core/http/middleware/request.go, which the realtime pipeline previously
// skipped (failing in distributed mode with "backend name is empty").
var _ = Describe("loadPipelineSubModel", func() {
It("resolves a sub-model alias one hop to the target's config", func() {
var _ = Describe("LoadResolvedModelConfig", func() {
It("resolves an alias one hop to the target's config", func() {
tmpDir := GinkgoT().TempDir()
// A real model config with a concrete backend.
@@ -38,13 +38,13 @@ alias: real-llm
Expect(cl.LoadModelConfigsFromPath(tmpDir)).To(Succeed())
// Resolving the alias must follow the hop to the target's full config.
resolved, err := loadPipelineSubModel(cl, "default", tmpDir)
resolved, err := cl.LoadResolvedModelConfig("default", tmpDir)
Expect(err).NotTo(HaveOccurred())
Expect(resolved.IsAlias()).To(BeFalse())
Expect(resolved.Backend).To(Equal("llama-cpp"))
// A non-alias name must load unchanged.
direct, err := loadPipelineSubModel(cl, "real-llm", tmpDir)
direct, err := cl.LoadResolvedModelConfig("real-llm", tmpDir)
Expect(err).NotTo(HaveOccurred())
Expect(direct.Backend).To(Equal("llama-cpp"))
Expect(direct.Name).To(Equal("real-llm"))

View File

@@ -1,56 +0,0 @@
package config
import (
gguf "github.com/gpustack/gguf-parser-go"
"github.com/mudler/xlog"
)
// swaCacheOptionNames lists the backend option keys that control the
// sliding-window-attention KV cache. If the user pinned any of these we leave
// the SWA cache alone instead of forcing swa_full.
var swaCacheOptionNames = []string{"swa_full", "n_swa"}
// HasSlidingWindowAttention reports whether the parsed GGUF describes a
// sliding-window-attention (SWA) model — Gemma 2/3, Cohere2, Llama 4 and the
// like. The gguf-parser library normalizes the per-architecture
// `<arch>.attention.sliding_window` metadata key into
// GGUFArchitecture.AttentionSlidingWindow, applying the same family-specific
// rules llama.cpp uses (e.g. Phi-3 carries the key but does not actually run
// SWA, and is normalized to 0). A non-zero window means the model interleaves
// SWA layers, so the returned size is also the diagnostic value we log.
func HasSlidingWindowAttention(f *gguf.GGUFFile) (uint64, bool) {
if f == nil {
return 0, false
}
w := f.Architecture().AttentionSlidingWindow
return w, w > 0
}
// ApplySWAFullDefault enables the full-size SWA KV cache (swa_full:true) for a
// sliding-window model, unless the user already pinned an SWA cache option.
//
// Why: llama.cpp defaults to a reduced SWA KV cache sized to the sliding window
// (memory-light), but that reduced cache cannot preserve a prompt prefix across
// requests. So for SWA models the cross-request prefix cache we enable in
// serving_defaults.go (cache_reuse) is silently defeated — every turn
// reprocesses the entire prompt. Setting swa_full:true makes llama.cpp keep the
// full KV cache so the shared prefix is actually reused.
//
// The tradeoff is memory: the full SWA cache scales with context_size, so this
// is gated to models that are genuinely SWA (never applied to dense models,
// where it would only waste memory) and never overrides an explicit user
// choice. `slidingWindow` is the value read from the GGUF and is used only for
// the diagnostic log line.
func ApplySWAFullDefault(cfg *ModelConfig, slidingWindow uint64) {
if cfg == nil || slidingWindow == 0 {
return
}
if backendOptionSet(cfg.Options, swaCacheOptionNames...) {
xlog.Debug("[swa] sliding-window model but an SWA cache option is already set; leaving user choice intact",
"name", cfg.Name, "sliding_window", slidingWindow)
return
}
cfg.Options = append(cfg.Options, "swa_full:true")
xlog.Debug("[swa] enabling swa_full for sliding-window model so the cross-request prompt-prefix cache survives (reduced SWA cache cannot reuse a prefix across requests)",
"name", cfg.Name, "sliding_window", slidingWindow)
}

View File

@@ -1,120 +0,0 @@
package config_test
import (
. "github.com/mudler/LocalAI/core/config"
gguf "github.com/gpustack/gguf-parser-go"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// ggufWithSlidingWindow fabricates a minimal in-memory GGUF carrying the given
// `general.architecture` and `<arch>.attention.sliding_window` so the SWA
// detection can be exercised without a real model file. A window of 0 omits the
// key, modelling a dense (non-SWA) model.
func ggufWithSlidingWindow(arch string, window uint32) *gguf.GGUFFile {
kvs := gguf.GGUFMetadataKVs{
{
Key: "general.architecture",
ValueType: gguf.GGUFMetadataValueTypeString,
Value: arch,
},
}
if window > 0 {
kvs = append(kvs, gguf.GGUFMetadataKV{
Key: arch + ".attention.sliding_window",
ValueType: gguf.GGUFMetadataValueTypeUint32,
Value: window,
})
}
return &gguf.GGUFFile{
Header: gguf.GGUFHeader{MetadataKV: kvs},
}
}
var _ = Describe("SWA full-cache auto-default", func() {
Context("HasSlidingWindowAttention", func() {
It("returns false on a nil GGUF file", func() {
w, ok := HasSlidingWindowAttention(nil)
Expect(ok).To(BeFalse())
Expect(w).To(BeZero())
})
It("detects a sliding-window model (Gemma 3 style)", func() {
w, ok := HasSlidingWindowAttention(ggufWithSlidingWindow("gemma3", 1024))
Expect(ok).To(BeTrue())
Expect(w).To(Equal(uint64(1024)))
})
It("detects Gemma 2 even without an explicit key (family default window)", func() {
// gguf-parser applies llama.cpp's family rules: gemma2 defaults the
// sliding window to 4096 when the metadata key is absent.
w, ok := HasSlidingWindowAttention(ggufWithSlidingWindow("gemma2", 0))
Expect(ok).To(BeTrue())
Expect(w).To(Equal(uint64(4096)))
})
It("reports a dense model as non-SWA", func() {
w, ok := HasSlidingWindowAttention(ggufWithSlidingWindow("llama", 0))
Expect(ok).To(BeFalse())
Expect(w).To(BeZero())
})
It("treats Phi-3 as non-SWA even when the key is present", func() {
// Phi-3 carries attention.sliding_window but does not actually run
// SWA; gguf-parser normalizes it to 0 to match llama.cpp.
w, ok := HasSlidingWindowAttention(ggufWithSlidingWindow("phi3", 2048))
Expect(ok).To(BeFalse())
Expect(w).To(BeZero())
})
})
Context("ApplySWAFullDefault", func() {
It("enables swa_full for a sliding-window model when unset", func() {
cfg := &ModelConfig{Name: "gemma3"}
ApplySWAFullDefault(cfg, 1024)
Expect(cfg.Options).To(ContainElement("swa_full:true"))
})
It("is a no-op for a dense model (window 0)", func() {
cfg := &ModelConfig{Name: "llama"}
ApplySWAFullDefault(cfg, 0)
Expect(cfg.Options).To(BeEmpty())
})
It("preserves an explicit swa_full:false", func() {
cfg := &ModelConfig{Name: "gemma3", Options: []string{"swa_full:false"}}
ApplySWAFullDefault(cfg, 1024)
Expect(cfg.Options).To(Equal([]string{"swa_full:false"}))
})
It("preserves an explicit swa_full:true without duplicating it", func() {
cfg := &ModelConfig{Name: "gemma3", Options: []string{"swa_full:true"}}
ApplySWAFullDefault(cfg, 1024)
Expect(cfg.Options).To(Equal([]string{"swa_full:true"}))
})
It("respects the n_swa alias", func() {
cfg := &ModelConfig{Name: "gemma3", Options: []string{"n_swa:512"}}
ApplySWAFullDefault(cfg, 1024)
Expect(cfg.Options).To(Equal([]string{"n_swa:512"}))
})
It("preserves unrelated options already on the config", func() {
cfg := &ModelConfig{
Name: "gemma3",
Options: []string{"use_jinja:true", "cache_reuse:256"},
}
ApplySWAFullDefault(cfg, 1024)
Expect(cfg.Options).To(Equal([]string{
"use_jinja:true",
"cache_reuse:256",
"swa_full:true",
}))
})
It("tolerates a nil config", func() {
Expect(func() { ApplySWAFullDefault(nil, 1024) }).ToNot(Panic())
})
})
})

View File

@@ -15,14 +15,35 @@ import (
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/downloader"
"github.com/mudler/LocalAI/pkg/system"
"github.com/mudler/LocalAI/pkg/utils"
"github.com/mudler/LocalAI/pkg/xsync"
"github.com/mudler/xlog"
"gopkg.in/yaml.v3"
)
// validateGalleryConfigURL guards the gallery config fetch against SSRF. A
// gallery config URL can be attacker-controlled (e.g. POST /models/apply with
// an empty id fetches it directly), so a plain http(s) URL must not be allowed
// to reach private, loopback, link-local or cloud-metadata addresses. Other
// schemes (huggingface://, github:, oci://, ollama://, file://) resolve to
// fixed public services or local files and are not a network-SSRF vector, so
// they are left untouched.
// See https://github.com/mudler/LocalAI/issues/10665
func validateGalleryConfigURL(rawURL string) error {
lower := strings.ToLower(strings.TrimSpace(rawURL))
if strings.HasPrefix(lower, "http://") || strings.HasPrefix(lower, "https://") {
return utils.ValidateExternalURL(rawURL)
}
return nil
}
func GetGalleryConfigFromURL[T any](url string, basePath string) (T, error) {
var config T
if err := validateGalleryConfigURL(url); err != nil {
xlog.Error("refusing to fetch gallery config", "error", err, "url", url)
return config, err
}
uri := downloader.URI(url)
err := uri.ReadWithCallback(basePath, func(url string, d []byte) error {
return yaml.Unmarshal(d, &config)
@@ -36,6 +57,10 @@ func GetGalleryConfigFromURL[T any](url string, basePath string) (T, error) {
func GetGalleryConfigFromURLWithContext[T any](ctx context.Context, url string, basePath string) (T, error) {
var config T
if err := validateGalleryConfigURL(url); err != nil {
xlog.Error("refusing to fetch gallery config", "error", err, "url", url)
return config, err
}
uri := downloader.URI(url)
err := uri.ReadWithAuthorizationAndCallback(ctx, basePath, "", func(url string, d []byte) error {
return yaml.Unmarshal(d, &config)

View File

@@ -1,6 +1,10 @@
package gallery_test
import (
"context"
"net/http"
"net/http/httptest"
. "github.com/mudler/LocalAI/core/gallery"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
@@ -19,4 +23,49 @@ var _ = Describe("Gallery API tests", func() {
Expect(e.Name).To(Equal("gpt4all-j"))
})
})
// SSRF guard: a user-supplied gallery config URL (e.g. POST /models/apply
// with an empty id) must not be able to reach internal network addresses.
// See https://github.com/mudler/LocalAI/issues/10665
Context("SSRF protection on config URLs", func() {
var server *httptest.Server
BeforeEach(func() {
// A reachable internal server that would happily serve a valid
// gallery config. Without the SSRF guard the fetch succeeds; the
// guard must block it before the request ever leaves the process.
server = httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("name: internal-ssrf\nfiles: []\n"))
}))
})
AfterEach(func() {
server.Close()
})
It("blocks fetching a config from a loopback address", func() {
_, err := GetGalleryConfigFromURL[ModelConfig](server.URL, "")
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("not allowed"))
})
It("blocks fetching a config from a loopback address (context variant)", func() {
_, err := GetGalleryConfigFromURLWithContext[ModelConfig](context.Background(), server.URL, "")
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("not allowed"))
})
It("blocks well-known internal hostnames and metadata endpoints", func() {
for _, u := range []string{
"http://localhost/secret",
"http://10.0.0.1/config.yaml",
"http://192.168.1.1/config.yaml",
"http://169.254.169.254/latest/meta-data/",
} {
_, err := GetGalleryConfigFromURL[ModelConfig](u, "")
Expect(err).To(HaveOccurred(), "expected %s to be rejected", u)
}
})
})
})

View File

@@ -65,6 +65,10 @@ type BackendEndpointService struct {
type GalleryBackend struct {
ID string `json:"id"`
// Force reinstalls the backend even when it is already installed and
// runnable. Off by default so apply stays idempotent for supervising
// apps that ensure their backend on every boot.
Force bool `json:"force"`
}
func CreateBackendEndpointService(galleries []config.Gallery, systemState *system.SystemState, backendApplier *galleryop.GalleryService, upgradeChecker UpgradeInfoProvider) BackendEndpointService {
@@ -103,7 +107,9 @@ func (mgs *BackendEndpointService) GetAllStatusEndpoint() echo.HandlerFunc {
}
}
// ApplyBackendEndpoint installs a new backend to a LocalAI instance
// ApplyBackendEndpoint installs a new backend to a LocalAI instance. The op is
// idempotent: an already-installed, runnable backend is left alone unless the
// request sets "force": true (explicit reinstall).
// @Summary Install backends to LocalAI.
// @Tags backends
// @Param request body GalleryBackend true "query params"
@@ -137,6 +143,7 @@ func (mgs *BackendEndpointService) ApplyBackendEndpoint(systemState *system.Syst
ID: uuid.String(),
GalleryElementName: input.ID,
Galleries: mgs.galleries,
Force: input.Force,
}
return c.JSON(200, schema.BackendResponse{ID: uuid.String(), StatusURL: fmt.Sprintf("%sbackends/jobs/%s", middleware.BaseURL(c), uuid.String())})

View File

@@ -0,0 +1,87 @@
package localai_test
import (
"net/http"
"net/http/httptest"
"os"
"strings"
"github.com/labstack/echo/v4"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/gallery"
. "github.com/mudler/LocalAI/core/http/endpoints/localai"
"github.com/mudler/LocalAI/core/services/galleryop"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/LocalAI/pkg/system"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// POST /backends/apply must be idempotent by default: supervising apps call it
// on every boot to ensure a backend exists, and forcing a reinstall there
// re-downloads the whole artifact each time. Reinstall stays available behind
// the explicit force flag.
var _ = Describe("POST /backends/apply force plumbing", func() {
var (
app *echo.Echo
gs *galleryop.GalleryService
tmpDir string
received chan galleryop.ManagementOp[gallery.GalleryBackend, any]
)
BeforeEach(func() {
app = echo.New()
var err error
tmpDir, err = os.MkdirTemp("", "backends-apply-test-*")
Expect(err).NotTo(HaveOccurred())
systemState, err := system.GetSystemState(system.WithBackendPath(tmpDir))
Expect(err).NotTo(HaveOccurred())
appConfig := &config.ApplicationConfig{SystemState: systemState}
// The service is deliberately not started: the test reads the op off
// the (unbuffered) channel itself.
gs = galleryop.NewGalleryService(appConfig, model.NewModelLoader(systemState))
svc := CreateBackendEndpointService(nil, systemState, gs, nil)
app.POST("/backends/apply", svc.ApplyBackendEndpoint(systemState))
received = make(chan galleryop.ManagementOp[gallery.GalleryBackend, any], 1)
go func() {
op := <-gs.BackendGalleryChannel
received <- op
}()
})
AfterEach(func() {
Expect(os.RemoveAll(tmpDir)).To(Succeed())
})
apply := func(body string) *httptest.ResponseRecorder {
req := httptest.NewRequest(http.MethodPost, "/backends/apply", strings.NewReader(body))
req.Header.Set(echo.HeaderContentType, echo.MIMEApplicationJSON)
rec := httptest.NewRecorder()
app.ServeHTTP(rec, req)
return rec
}
It("enqueues a non-forced op by default", func() {
rec := apply(`{"id":"llama-cpp"}`)
Expect(rec.Code).To(Equal(http.StatusOK))
var op galleryop.ManagementOp[gallery.GalleryBackend, any]
Eventually(received).Should(Receive(&op))
Expect(op.GalleryElementName).To(Equal("llama-cpp"))
Expect(op.Force).To(BeFalse())
})
It("enqueues a forced op when the request sets force", func() {
rec := apply(`{"id":"llama-cpp","force":true}`)
Expect(rec.Code).To(Equal(http.StatusOK))
var op galleryop.ManagementOp[gallery.GalleryBackend, any]
Eventually(received).Should(Receive(&op))
Expect(op.GalleryElementName).To(Equal("llama-cpp"))
Expect(op.Force).To(BeTrue())
})
})

View File

@@ -0,0 +1,54 @@
package localai
import (
"net/http"
"github.com/labstack/echo/v4"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/xlog"
)
// LoadModelEndpoint pre-loads a model into memory by name — the inverse of
// /backend/shutdown. For a realtime pipeline model every configured sub-model
// (VAD, transcription, LLM, TTS, sound_detection, voice_recognition) is loaded; for a regular
// model its own backend is loaded. The call blocks until loading finishes so
// clients can drive warm-up explicitly and learn up front whether a model
// fails to load.
// @Summary Pre-load a model into memory
// @Description Loads the named model (or, for a realtime pipeline, all of its sub-models) into memory so subsequent requests pay no cold-start cost. The inverse of /backend/shutdown.
// @Tags monitoring
// @Accept json
// @Produce json
// @Param request body schema.ModelLoadRequest true "Model to load"
// @Success 200 {object} schema.ModelLoadResponse "Model loaded"
// @Failure 400 {object} schema.ModelLoadResponse "Missing model name"
// @Failure 500 {object} schema.ModelLoadResponse "Load failed (Loaded lists any sub-models that did load)"
// @Router /backend/load [post]
func LoadModelEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) echo.HandlerFunc {
return func(c echo.Context) error {
input := new(schema.ModelLoadRequest)
if err := c.Bind(input); err != nil {
return err
}
if input.Model == "" {
return c.JSON(http.StatusBadRequest, schema.ModelLoadResponse{Message: "model is required"})
}
loaded, err := backend.PreloadModelByName(c.Request().Context(), cl, ml, appConfig, input.Model)
if err != nil {
xlog.Error("failed to pre-load model", "model", input.Model, "loaded", loaded, "error", err)
return c.JSON(http.StatusInternalServerError, schema.ModelLoadResponse{
Loaded: loaded,
Message: "failed to load model: " + err.Error(),
})
}
return c.JSON(http.StatusOK, schema.ModelLoadResponse{
Loaded: loaded,
Message: "model loaded",
})
}
}

View File

@@ -0,0 +1,102 @@
package localai_test
import (
"bytes"
"encoding/json"
"net/http"
"net/http/httptest"
"os"
"path/filepath"
"github.com/labstack/echo/v4"
"github.com/mudler/LocalAI/core/config"
. "github.com/mudler/LocalAI/core/http/endpoints/localai"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/LocalAI/pkg/system"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("LoadModelEndpoint (/backend/load)", func() {
var (
app *echo.Echo
tempDir string
configLoader *config.ModelConfigLoader
modelLoader *model.ModelLoader
appConfig *config.ApplicationConfig
)
post := func(body string) *httptest.ResponseRecorder {
req := httptest.NewRequest(http.MethodPost, "/backend/load", bytes.NewBufferString(body))
req.Header.Set(echo.HeaderContentType, echo.MIMEApplicationJSON)
rec := httptest.NewRecorder()
app.ServeHTTP(rec, req)
return rec
}
decode := func(rec *httptest.ResponseRecorder) schema.ModelLoadResponse {
var resp schema.ModelLoadResponse
Expect(json.Unmarshal(rec.Body.Bytes(), &resp)).To(Succeed())
return resp
}
writeConfig := func(name, contents string) {
Expect(os.WriteFile(filepath.Join(tempDir, name+".yaml"), []byte(contents), 0o600)).To(Succeed())
}
BeforeEach(func() {
var err error
tempDir, err = os.MkdirTemp("", "backend-load-test-*")
Expect(err).NotTo(HaveOccurred())
systemState, err := system.GetSystemState(system.WithModelPath(tempDir))
Expect(err).NotTo(HaveOccurred())
appConfig = config.NewApplicationConfig(config.WithSystemState(systemState))
configLoader = config.NewModelConfigLoader(tempDir)
modelLoader = model.NewModelLoader(systemState) // no backends installed
app = echo.New()
app.POST("/backend/load", LoadModelEndpoint(configLoader, modelLoader, appConfig))
})
AfterEach(func() {
_ = os.RemoveAll(tempDir)
})
It("rejects a request with no model name", func() {
rec := post(`{}`)
Expect(rec.Code).To(Equal(http.StatusBadRequest))
Expect(decode(rec).Message).To(ContainSubstring("model is required"))
})
It("reports a load failure for a regular model with nothing loaded", func() {
writeConfig("solo", "name: solo\n")
rec := post(`{"model":"solo"}`)
Expect(rec.Code).To(Equal(http.StatusInternalServerError))
resp := decode(rec)
Expect(resp.Loaded).To(BeEmpty())
Expect(resp.Message).To(ContainSubstring("failed to load model"))
})
It("expands a pipeline model and reports each sub-model that failed to load", func() {
writeConfig("voicebot", "name: voicebot\npipeline:\n vad: vad-m\n transcription: stt-m\n llm: llm-m\n tts: tts-m\n")
writeConfig("vad-m", "name: vad-m\n")
writeConfig("stt-m", "name: stt-m\n")
writeConfig("llm-m", "name: llm-m\n")
writeConfig("tts-m", "name: tts-m\n")
rec := post(`{"model":"voicebot"}`)
Expect(rec.Code).To(Equal(http.StatusInternalServerError))
resp := decode(rec)
Expect(resp.Message).To(ContainSubstring("failed to load model"))
// The pipeline stub itself is never loaded; its sub-models are what the
// endpoint tries, so the error names them rather than "voicebot".
Expect(resp.Message).To(ContainSubstring("vad-m"))
Expect(resp.Message).ToNot(ContainSubstring("voicebot"))
})
})

View File

@@ -51,6 +51,9 @@ func (stubClient) EditModelConfig(_ context.Context, _ string, _ map[string]any)
return nil
}
func (stubClient) ReloadModels(_ context.Context) error { return nil }
func (stubClient) LoadModel(_ context.Context, model string) ([]string, error) {
return []string{model}, nil
}
func (stubClient) SetAlias(_ context.Context, _, _ string) error {
return nil
}

View File

@@ -7,6 +7,7 @@ import (
"encoding/binary"
"encoding/hex"
"encoding/json"
"errors"
"fmt"
"math"
"os"
@@ -266,6 +267,12 @@ type Model interface {
// grpcerrors.IsLiveTranscriptionUnsupported.
TranscribeLive(ctx context.Context, language string, onEvent func(backend.LiveTranscriptionEvent)) (backend.LiveTranscriptionSession, error)
PredictConfig() *config.ModelConfig
// Warmup eagerly loads the pipeline's sub-model backends into memory so the
// first realtime turn doesn't pay each backend's cold-start load cost. Loads
// run concurrently; Warmup blocks until they all finish and returns a joined
// error naming every stage that failed to load (nil if all succeeded), so a
// caller can surface model-load failures at session start instead of mid-call.
Warmup(ctx context.Context) error
}
var upgrader = websocket.Upgrader{
@@ -583,18 +590,8 @@ func runRealtimeSession(application *application.Application, t Transport, model
}
session.ModelInterface = m
if session.SummaryModel != "" {
summaryModelName := session.SummaryModel
sid := sessionID
session.summarizerFactory = func() (Model, error) {
summaryCfg, lerr := application.ModelConfigLoader().LoadModelConfigFileByNameDefaultOptions(summaryModelName, application.ApplicationConfig())
if lerr != nil {
return nil, fmt.Errorf("load summary model config %q: %w", summaryModelName, lerr)
}
return newModel(&summaryCfg.Pipeline, application.ModelConfigLoader(), application.ModelLoader(), application.ApplicationConfig(), evaluator, buildRealtimeRoutingContext(application, sid))
}
}
// The voice gate is built before the warm-up below so its
// speaker-recognition model can warm alongside the pipeline stages.
if cfg.Pipeline.VoiceGateEnabled() {
gate, gerr := newVoiceGate(
*cfg.Pipeline.VoiceRecognition,
@@ -612,6 +609,47 @@ func runRealtimeSession(application *application.Application, t Transport, model
xlog.Info("realtime voice recognition gate enabled", "mode", gate.cfg.Mode, "when", gate.cfg.When)
}
// Warm the pipeline's sub-model backends before announcing the session.
// Loads run concurrently but we block here until they all finish, so a model
// that fails to load (missing weights, bad backend, OOM) surfaces as an error
// at session start rather than stalling — or failing — mid-call on the first
// turn (VAD on the first audio chunk, STT at end-of-speech, LLM on the first
// reply, TTS on the first spoken output). On success the backends are already
// resident, so the first turn pays no cold-start cost. Opt out per pipeline
// with `pipeline.disable_warmup: true` to restore lazy load-on-first-use
// (errors then surface on first use instead of at session start).
if !cfg.Pipeline.DisableWarmup {
warmErr := make(chan error, 1)
go func() { warmErr <- m.Warmup(context.Background()) }()
// The voice-gate model warms concurrently with the pipeline stages: an
// enforced gate blocks each utterance on speaker resolution, so its
// cold-start would otherwise land on the first turn too. (Compaction's
// summary_model stays lazy — it only runs off the response path.)
var gateErr error
if session.voiceGate != nil {
_, gateErr = backend.PreloadStages(context.Background(), application.ModelLoader(), application.ApplicationConfig(), []backend.PreloadStage{
{Role: "voice_recognition", Cfg: session.voiceGate.recCfg},
})
}
if err := errors.Join(<-warmErr, gateErr); err != nil {
xlog.Error("realtime warmup failed", "model", model, "error", err)
sendError(t, "model_load_error", "Failed to load pipeline models: "+err.Error(), "", "")
return
}
}
if session.SummaryModel != "" {
summaryModelName := session.SummaryModel
sid := sessionID
session.summarizerFactory = func() (Model, error) {
summaryCfg, lerr := application.ModelConfigLoader().LoadModelConfigFileByNameDefaultOptions(summaryModelName, application.ApplicationConfig())
if lerr != nil {
return nil, fmt.Errorf("load summary model config %q: %w", summaryModelName, lerr)
}
return newModel(&summaryCfg.Pipeline, application.ModelConfigLoader(), application.ModelLoader(), application.ApplicationConfig(), evaluator, buildRealtimeRoutingContext(application, sid))
}
}
// Store the session and notify the transport (for WebRTC audio track handling)
sessionLock.Lock()
sessions[sessionID] = session
@@ -1125,6 +1163,21 @@ func updateSession(session *Session, update *types.SessionUnion, cl *config.Mode
return err
}
session.ModelInterface = m
// A session.update that swaps the model/voice rebuilds the pipeline, so
// warm the new backends too (unless opted out) — otherwise the next turn
// pays the cold-start load the original session warm-up already avoided.
// Unlike session start this stays non-blocking: updateSession runs under
// the global sessionLock, so blocking on a multi-second load here would
// stall every other session. Load errors are logged (and still surface on
// first use); per-stage failures are already warned inside
// backend.PreloadStages.
if !session.ModelConfig.Pipeline.DisableWarmup {
go func() {
if err := m.Warmup(context.Background()); err != nil {
xlog.Error("realtime warmup failed after session.update", "error", err)
}
}()
}
}
if rt.Audio != nil && rt.Audio.Input != nil && rt.Audio.Input.TurnDetectionSet {

View File

@@ -174,6 +174,8 @@ func (m *fakeModel) TranscribeLive(_ context.Context, _ string, onEvent func(bac
func (m *fakeModel) PredictConfig() *config.ModelConfig { return m.cfg }
func (m *fakeModel) Warmup(ctx context.Context) error { return nil }
// fakeLiveSession records what semantic_vad fed and closed; closeEvents are
// replayed through onEvent during Close, mimicking the backend's finalize
// flush (trailing delta + Final) landing before Close returns.

View File

@@ -110,6 +110,15 @@ func (m *transcriptOnlyModel) PredictConfig() *config.ModelConfig {
return nil
}
func (m *transcriptOnlyModel) Warmup(ctx context.Context) error {
_, err := backend.PreloadStages(ctx, m.modelLoader, m.appConfig, []backend.PreloadStage{
{Role: "vad", Cfg: m.VADConfig},
{Role: "transcription", Cfg: m.TranscriptionConfig},
{Role: "sound_detection", Cfg: m.SoundDetectionConfig},
})
return err
}
func (m *wrappedModel) VAD(ctx context.Context, request *schema.VADRequest) (*schema.VADResponse, error) {
return backend.VAD(request, ctx, m.modelLoader, m.appConfig, *m.VADConfig)
}
@@ -360,6 +369,17 @@ func (m *wrappedModel) PredictConfig() *config.ModelConfig {
return m.LLMConfig
}
func (m *wrappedModel) Warmup(ctx context.Context) error {
_, err := backend.PreloadStages(ctx, m.modelLoader, m.appConfig, []backend.PreloadStage{
{Role: "vad", Cfg: m.VADConfig},
{Role: "transcription", Cfg: m.TranscriptionConfig},
{Role: "llm", Cfg: m.LLMConfig},
{Role: "tts", Cfg: m.TTSConfig},
{Role: "sound_detection", Cfg: m.SoundDetectionConfig},
})
return err
}
// wavStreamHeaderBytes is the size of the WAV header that backend.ModelTTSStream
// emits as its first audio callback; the sample rate lives at byte offset 24.
const wavStreamHeaderBytes = 44
@@ -440,7 +460,7 @@ func loadSoundDetectionConfig(pipeline *config.Pipeline, cl *config.ModelConfigL
if pipeline.SoundDetection == "" {
return nil, nil
}
cfg, err := loadPipelineSubModel(cl, pipeline.SoundDetection, ml.ModelPath)
cfg, err := cl.LoadResolvedModelConfig(pipeline.SoundDetection, ml.ModelPath)
if err != nil {
return nil, fmt.Errorf("failed to load sound detection config: %w", err)
}
@@ -451,7 +471,7 @@ func loadSoundDetectionConfig(pipeline *config.Pipeline, cl *config.ModelConfigL
}
func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) (Model, *config.ModelConfig, error) {
cfgVAD, err := loadPipelineSubModel(cl, pipeline.VAD, ml.ModelPath)
cfgVAD, err := cl.LoadResolvedModelConfig(pipeline.VAD, ml.ModelPath)
if err != nil {
return nil, nil, fmt.Errorf("failed to load backend config: %w", err)
@@ -461,7 +481,7 @@ func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfig
return nil, nil, fmt.Errorf("failed to validate config: %w", err)
}
cfgSST, err := loadPipelineSubModel(cl, pipeline.Transcription, ml.ModelPath)
cfgSST, err := cl.LoadResolvedModelConfig(pipeline.Transcription, ml.ModelPath)
if err != nil {
return nil, nil, fmt.Errorf("failed to load backend config: %w", err)
@@ -550,30 +570,11 @@ func buildRealtimeRoutingContext(a *application.Application, sessionID string) *
}
}
// loadPipelineSubModel loads a pipeline sub-model config by name and follows a
// single alias hop, so a pipeline that references an alias (e.g. `llm: default`)
// gets the alias target's full config (Backend, Model, ...) rather than the
// alias stub with an empty Backend. Without this the alias survives unresolved
// into model loading and fails downstream — notably in distributed mode with
// "backend name is empty". Mirrors the top-level alias resolution in
// core/http/middleware/request.go.
func loadPipelineSubModel(cl *config.ModelConfigLoader, name, modelPath string) (*config.ModelConfig, error) {
cfg, err := cl.LoadModelConfigFileByName(name, modelPath)
if err != nil {
return nil, err
}
resolved, _, err := cl.ResolveAlias(cfg)
if err != nil {
return nil, err
}
return resolved, nil
}
// returns and loads either a wrapped model or a model that support audio-to-audio
func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig, evaluator *templates.Evaluator, routing *RealtimeRoutingContext) (Model, error) {
xlog.Debug("Creating new model pipeline model", "pipeline", pipeline)
cfgVAD, err := loadPipelineSubModel(cl, pipeline.VAD, ml.ModelPath)
cfgVAD, err := cl.LoadResolvedModelConfig(pipeline.VAD, ml.ModelPath)
if err != nil {
return nil, fmt.Errorf("failed to load backend config: %w", err)
@@ -584,7 +585,7 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
}
// TODO: Do we always need a transcription model? It can be disabled. Note that any-to-any instruction following models don't transcribe as such, so if transcription is required it is a separate process
cfgSST, err := loadPipelineSubModel(cl, pipeline.Transcription, ml.ModelPath)
cfgSST, err := cl.LoadResolvedModelConfig(pipeline.Transcription, ml.ModelPath)
if err != nil {
return nil, fmt.Errorf("failed to load backend config: %w", err)
@@ -616,7 +617,7 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
xlog.Debug("Loading a wrapped model")
// Otherwise we want to return a wrapped model, which is a "virtual" model that re-uses other models to perform operations
cfgLLM, err := loadPipelineSubModel(cl, pipeline.LLM, ml.ModelPath)
cfgLLM, err := cl.LoadResolvedModelConfig(pipeline.LLM, ml.ModelPath)
if err != nil {
return nil, fmt.Errorf("failed to load backend config: %w", err)
@@ -631,7 +632,7 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
applyPipelineReasoning(cfgLLM, *pipeline)
applyPipelineThinking(cfgLLM, *pipeline)
cfgTTS, err := loadPipelineSubModel(cl, pipeline.TTS, ml.ModelPath)
cfgTTS, err := cl.LoadResolvedModelConfig(pipeline.TTS, ml.ModelPath)
if err != nil {
return nil, fmt.Errorf("failed to load backend config: %w", err)

View File

@@ -21,6 +21,7 @@ type namedEmbedding struct {
// drive the realtime pipeline.
type voiceGate struct {
cfg config.PipelineVoiceRecognition // normalized
recCfg *config.ModelConfig // resolved speaker-recognition model, for warm-up
registry voicerecognition.Registry // identify mode (nil otherwise)
refEmbeds []namedEmbedding // verify mode, pre-embedded refs
refAudios []config.VoiceReference // verify + anti-spoofing: ref paths
@@ -72,7 +73,9 @@ func newVoiceGate(
return nil, err
}
recCfg, err := cl.LoadModelConfigFileByName(cfg.Model, ml.ModelPath)
// Resolved like every other pipeline sub-model (one alias hop), so an
// aliased voice_recognition model gets its target's backend.
recCfg, err := cl.LoadResolvedModelConfig(cfg.Model, ml.ModelPath)
if err != nil {
return nil, fmt.Errorf("voice_recognition: failed to load model %q: %w", cfg.Model, err)
}
@@ -82,6 +85,7 @@ func newVoiceGate(
g := &voiceGate{
cfg: cfg,
recCfg: recCfg,
registry: registry,
embedFn: func(ctx context.Context, wavPath string) ([]float32, error) {
res, err := backend.VoiceEmbed(ctx, wavPath, ml, appConfig, *recCfg)

View File

@@ -0,0 +1,64 @@
package openai
import (
"context"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/LocalAI/pkg/system"
)
// Warmup delegates to backend.PreloadStages (its concurrency, nil-skipping and
// error-joining semantics are pinned in core/backend). These specs pin the
// wiring instead: each realtime model type must warm exactly its configured
// stages under the right pipeline-role labels. No backends are installed, so
// every attempted stage fails to load — the joined error is the proof of which
// stages were attempted and how they were labeled.
var _ = Describe("realtime model Warmup wiring", func() {
newLoader := func() (*model.ModelLoader, *config.ApplicationConfig) {
systemState, err := system.GetSystemState(system.WithModelPath(GinkgoT().TempDir()))
Expect(err).ToNot(HaveOccurred())
appConfig := config.NewApplicationConfig(config.WithSystemState(systemState))
return model.NewModelLoader(systemState), appConfig
}
It("wrappedModel warms every configured stage under its pipeline role", func() {
ml, appConfig := newLoader()
m := &wrappedModel{
VADConfig: &config.ModelConfig{Name: "vad-m"},
TranscriptionConfig: &config.ModelConfig{Name: "stt-m"},
LLMConfig: &config.ModelConfig{Name: "llm-m"},
TTSConfig: &config.ModelConfig{Name: "tts-m"},
SoundDetectionConfig: &config.ModelConfig{Name: "ced-m"},
modelLoader: ml,
appConfig: appConfig,
}
err := m.Warmup(context.Background())
Expect(err).To(HaveOccurred())
for _, stage := range []string{"vad (vad-m)", "transcription (stt-m)", "llm (llm-m)", "tts (tts-m)", "sound_detection (ced-m)"} {
Expect(err.Error()).To(ContainSubstring(stage))
}
})
It("transcriptOnlyModel warms its stages and skips absent ones", func() {
ml, appConfig := newLoader()
m := &transcriptOnlyModel{
VADConfig: &config.ModelConfig{Name: "vad-m"},
TranscriptionConfig: &config.ModelConfig{Name: "stt-m"},
// SoundDetectionConfig nil: an absent stage must be skipped, not
// fail the warm-up.
modelLoader: ml,
appConfig: appConfig,
}
err := m.Warmup(context.Background())
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("vad (vad-m)"))
Expect(err.Error()).To(ContainSubstring("transcription (stt-m)"))
Expect(err.Error()).ToNot(ContainSubstring("sound_detection"))
})
})

View File

@@ -7,6 +7,7 @@ import (
"io"
"net/http"
"os"
"path/filepath"
"strings"
"time"
@@ -29,6 +30,8 @@ const testModel = "Qwen3-VL-2B-Instruct-Q4_K_M"
var _ = Describe("Open Responses API", func() {
var app *echo.Echo
var localApp *application.Application
var localModelDir string
var c context.Context
var cancel context.CancelFunc
@@ -38,28 +41,47 @@ var _ = Describe("Open Responses API", func() {
Context("API with ephemeral models", func() {
BeforeEach(func(sc SpecContext) {
var err error
// This suite exercises the /v1/responses HTTP/protocol contract
// (Content-Type, SSE framing, response envelope, error shapes),
// not real inference — so it runs against the same prebuilt
// mock-backend the rest of the http suite uses instead of
// downloading a real model. Skip cleanly when it isn't built.
if mockBackendPath == "" {
Skip("mock-backend binary not built; run 'make build-mock-backend'")
}
backendPath := os.Getenv("BACKENDS_PATH")
var err error
c, cancel = context.WithCancel(context.Background())
// Isolated model dir carrying a single config named after testModel
// but served by the mock backend, so the responses endpoint can
// resolve and load the model without any real backend build.
localModelDir, err = os.MkdirTemp("", "openresponses-models-")
Expect(err).ToNot(HaveOccurred())
mockModelYAML := "name: " + testModel + "\n" +
"backend: mock-backend\n" +
"parameters:\n" +
" model: mock-model.bin\n"
Expect(os.WriteFile(filepath.Join(localModelDir, testModel+".yaml"), []byte(mockModelYAML), 0644)).To(Succeed())
systemState, err := system.GetSystemState(
system.WithBackendPath(backendPath),
system.WithModelPath(modelDir),
system.WithBackendPath(backendDir),
system.WithModelPath(localModelDir),
)
Expect(err).ToNot(HaveOccurred())
application, err := application.New(
localApp, err = application.New(
append(commonOpts,
config.WithContext(c),
config.WithSystemState(systemState),
config.WithApiKeys([]string{apiKey}),
config.WithModelsURL("https://huggingface.co/unsloth/Qwen3-VL-2B-Instruct-GGUF"),
)...)
Expect(err).ToNot(HaveOccurred())
localApp.ModelLoader().SetExternalBackend("mock-backend", mockBackendPath)
app, err = API(application)
app, err = API(localApp)
Expect(err).ToNot(HaveOccurred())
go func() {
@@ -80,14 +102,24 @@ var _ = Describe("Open Responses API", func() {
})
AfterEach(func(sc SpecContext) {
// Synchronous app shutdown first — context-cancel cleanup is async
// and races test-binary exit, orphaning mock-backend children.
if localApp != nil {
_ = localApp.Shutdown()
localApp = nil
}
cancel()
if app != nil {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
err := app.Shutdown(ctx)
Expect(err).ToNot(HaveOccurred())
app = nil
}
if localModelDir != "" {
_ = os.RemoveAll(localModelDir)
localModelDir = ""
}
})
Context("HTTP Protocol Compliance", func() {
@@ -969,13 +1001,16 @@ var _ = Describe("Open Responses API", func() {
Expect(ok).To(BeTrue())
Expect(itemID).ToNot(BeEmpty())
// Now create a new response with item_reference
// Now create a new response with item_reference. Per the OpenAI
// Responses spec (and this server's parser in
// endpoints/openresponses/responses.go) an item_reference carries
// the referenced item in the "id" field, not "item_id".
reqBody2 := map[string]any{
"model": testModel,
"input": []any{
map[string]any{
"type": "item_reference",
"item_id": itemID,
"type": "item_reference",
"id": itemID,
},
map[string]any{
"type": "message",
@@ -1005,8 +1040,8 @@ var _ = Describe("Open Responses API", func() {
"model": testModel,
"input": []any{
map[string]any{
"type": "item_reference",
"item_id": "nonexistent_item_id",
"type": "item_reference",
"id": "nonexistent_item_id",
},
},
}

View File

@@ -0,0 +1,133 @@
import { test, expect } from './coverage-fixtures.js'
// Seeds two-message chat into localStorage so we don't need a live model.
async function seedChat(page, history) {
await page.addInitScript((h) => {
const chat = {
id: 'seed1', name: 'Seeded Chat', model: 'test-model',
history: h, systemPrompt: '', mcpMode: false, mcpServers: [],
clientMCPServers: [], temperature: null, topP: null, topK: null,
tokenUsage: { prompt: 0, completion: 0, total: 0 },
contextSize: null, createdAt: Date.now(), updatedAt: Date.now(),
}
localStorage.setItem('localai_chats_data', JSON.stringify({
chats: [chat], activeChatId: 'seed1', lastSaved: Date.now(),
}))
}, history)
}
async function mockModels(page) {
await page.route('**/api/models/capabilities', (route) => route.fulfill({
contentType: 'application/json',
body: JSON.stringify({ data: [{ id: 'test-model', capabilities: ['FLAG_CHAT'] }] }),
}))
await page.route('**/api/operations', (route) => route.fulfill({
contentType: 'application/json', body: JSON.stringify({ operations: [] }),
}))
}
const TWO_TURNS = [
{ role: 'user', content: 'first question' },
{ role: 'assistant', content: 'first answer' },
{ role: 'user', content: 'second question' },
{ role: 'assistant', content: 'second answer' },
]
test('duplicate creates an independent copy and switches to it', async ({ page }) => {
await mockModels(page)
await seedChat(page, TWO_TURNS)
await page.goto('/app/chat')
// Open the chats menu (Ctrl/Cmd+K) and duplicate the seeded chat.
// Wait for the menu trigger to mount so its global keydown listener is armed
// before we dispatch the shortcut.
await page.getByTitle('Conversations (Ctrl/Cmd+K)').waitFor()
await page.keyboard.press('Control+k')
await page.getByTitle('Duplicate chat').first().click()
// A new active chat named "Seeded Chat (fork)" with the same 4 messages.
await expect(page.locator('.chat-header-title')).toHaveText('Seeded Chat (fork)')
await expect(page.locator('.chat-message-user')).toHaveCount(2)
await expect(page.locator('.chat-message-assistant')).toHaveCount(2)
})
async function mockCompletion(page, replyText) {
await page.route('**/v1/chat/completions', (route) => {
const sse =
`data: ${JSON.stringify({ choices: [{ delta: { content: replyText } }] })}\n\n` +
`data: ${JSON.stringify({ choices: [{ delta: {}, finish_reason: 'stop' }], usage: { prompt_tokens: 1, completion_tokens: 1, total_tokens: 2 } })}\n\n` +
`data: [DONE]\n\n`
route.fulfill({ status: 200, contentType: 'text/event-stream', body: sse })
})
}
test('retry regenerates the first answer and drops the later turn', async ({ page }) => {
await mockModels(page)
// Capture the outbound request body so we can assert the model receives the
// truncated history (not the stale downstream turns).
let sentMessages = null
await page.route('**/v1/chat/completions', (route) => {
sentMessages = route.request().postDataJSON()?.messages || []
const sse =
`data: ${JSON.stringify({ choices: [{ delta: { content: 'REGENERATED first answer' } }] })}\n\n` +
`data: ${JSON.stringify({ choices: [{ delta: {}, finish_reason: 'stop' }], usage: { prompt_tokens: 1, completion_tokens: 1, total_tokens: 2 } })}\n\n` +
`data: [DONE]\n\n`
route.fulfill({ status: 200, contentType: 'text/event-stream', body: sse })
})
await seedChat(page, TWO_TURNS)
await page.goto('/app/chat')
// Hover the FIRST assistant message and click its retry button.
const firstAssistant = page.locator('.chat-message-assistant').first()
await firstAssistant.hover()
await firstAssistant.getByTitle('Regenerate').click()
// History is truncated to the first user turn, then the new answer streams in;
// the second Q/A turn is gone.
await expect(page.locator('.chat-message-assistant')).toContainText(['REGENERATED first answer'])
await expect(page.locator('.chat-message-user')).toHaveCount(1)
await expect(page.locator('.chat-message-assistant')).toHaveCount(1)
// The OUTBOUND payload must also be truncated: the resent user turn is present,
// but the downstream turn and the stale first answer must be gone.
const contents = (sentMessages || []).map(m =>
typeof m.content === 'string' ? m.content : JSON.stringify(m.content)
)
expect(contents.join('\n')).toContain('first question')
expect(contents.join('\n')).not.toContain('second question')
expect(contents.join('\n')).not.toContain('first answer')
})
test('copy chat puts the whole conversation on the clipboard', async ({ page, context }) => {
await context.grantPermissions(['clipboard-read', 'clipboard-write'])
await mockModels(page)
await seedChat(page, TWO_TURNS)
await page.goto('/app/chat')
// Wait for the menu trigger to mount so its global keydown listener is armed
// before we dispatch the shortcut (same mount-race guard as the duplicate test).
await page.getByTitle('Conversations (Ctrl/Cmd+K)').waitFor()
await page.keyboard.press('Control+k')
await page.getByTitle('Copy chat').first().click()
const clip = await page.evaluate(() => navigator.clipboard.readText())
expect(clip).toContain('# Seeded Chat')
expect(clip).toContain('first answer')
expect(clip).toContain('second answer')
})
test('branch from the first answer forks history up to that point', async ({ page }) => {
await mockModels(page)
await seedChat(page, TWO_TURNS)
await page.goto('/app/chat')
const firstAssistant = page.locator('.chat-message-assistant').first()
await firstAssistant.hover()
await firstAssistant.getByTitle('Branch from here').click()
// New active chat "Seeded Chat (fork)" contains only the first Q/A turn.
await expect(page.locator('.chat-header-title')).toHaveText('Seeded Chat (fork)')
await expect(page.locator('.chat-message-user')).toHaveCount(1)
await expect(page.locator('.chat-message-assistant')).toHaveCount(1)
await expect(page.locator('.chat-message-assistant')).toContainText(['first answer'])
})

View File

@@ -72,6 +72,7 @@
"actions": {
"copy": "Copy",
"regenerate": "Regenerate",
"branch": "Branch from here",
"jumpToLatest": "Jump to latest"
},
"streaming": {
@@ -100,7 +101,9 @@
"toasts": {
"selectModel": "Please select a model",
"copied": "Copied to clipboard",
"copyFailed": "Could not copy to clipboard"
"copyFailed": "Could not copy to clipboard",
"chatCopied": "Chat copied to clipboard",
"forked": "Created a new chat"
},
"menu": {
"trigger": "Chats",
@@ -110,6 +113,8 @@
"noMatch": "No conversations match your search",
"noConversations": "No conversations yet",
"rename": "Rename",
"duplicate": "Duplicate chat",
"copyChat": "Copy chat",
"exportMarkdown": "Export as Markdown",
"deleteChat": "Delete chat",
"newChat": "New chat",

View File

@@ -24,6 +24,8 @@ const ChatsMenu = forwardRef(function ChatsMenu({
onDeleteAll,
onRename,
onExport,
onCopyChat,
onDuplicate,
}, ref) {
const { t } = useTranslation('chat')
const [open, setOpen] = useState(false)
@@ -230,6 +232,24 @@ const ChatsMenu = forwardRef(function ChatsMenu({
>
<i className="fas fa-pen" />
</button>
{onDuplicate && (
<button
type="button"
onClick={(e) => { e.stopPropagation(); onDuplicate(chat); setOpen(false) }}
title={t('menu.duplicate')}
>
<i className="fas fa-clone" />
</button>
)}
{(chat.history?.length || 0) > 0 && onCopyChat && (
<button
type="button"
onClick={(e) => { e.stopPropagation(); onCopyChat(chat) }}
title={t('menu.copyChat')}
>
<i className="fas fa-clipboard" />
</button>
)}
{(chat.history?.length || 0) > 0 && onExport && (
<button
type="button"

View File

@@ -141,6 +141,24 @@ export function useChat(initialModel = '') {
return chat
}, [])
const forkChat = useCallback((chatId, uptoIndex) => {
const src = chats.find(c => c.id === chatId)
if (!src) return null
const end = typeof uptoIndex === 'number' ? uptoIndex : src.history.length
const forked = {
...src,
id: generateId(),
name: `${src.name} (fork)`,
history: structuredClone(src.history.slice(0, end)),
tokenUsage: { prompt: 0, completion: 0, total: 0 },
createdAt: Date.now(),
updatedAt: Date.now(),
}
setChats(prev => [forked, ...prev])
setActiveChatId(forked.id)
return forked
}, [chats])
const switchChat = useCallback((chatId) => {
setActiveChatId(chatId)
setStreamingContent('')
@@ -260,8 +278,12 @@ export function useChat(initialModel = '') {
if (chat?.systemPrompt) {
messages.push({ role: 'system', content: chat.systemPrompt })
}
// Filter out thinking/reasoning/tool_call/tool_result messages
const historyForApi = (chat?.history || []).filter(m =>
// Filter out thinking/reasoning/tool_call/tool_result messages.
// options.baseHistory lets callers (e.g. mid-conversation retry) pass the
// intended truncated history synchronously; the closure `chat` still holds
// the stale pre-truncation state because setChats only schedules an update.
const baseHistory = options.baseHistory || chat?.history || []
const historyForApi = baseHistory.filter(m =>
m.role !== 'thinking' && m.role !== 'reasoning' && m.role !== 'tool_call' && m.role !== 'tool_result'
)
messages.push(...historyForApi, { role: 'user', content: messageContent })
@@ -793,6 +815,7 @@ export function useChat(initialModel = '') {
tokensPerSecond,
maxTokensPerSecond,
addChat,
forkChat,
switchChat,
deleteChat,
deleteAllChats,

View File

@@ -33,7 +33,7 @@ function getLastMessagePreview(chat) {
return ''
}
function exportChatAsMarkdown(chat) {
function serializeChatAsMarkdown(chat) {
let md = `# ${chat.name}\n\n`
md += `Model: ${chat.model || 'Unknown'}\n`
md += `Date: ${new Date(chat.createdAt).toLocaleString()}\n\n---\n\n`
@@ -47,7 +47,11 @@ function exportChatAsMarkdown(chat) {
md += `<details><summary>Thinking</summary>\n\n${msg.content}\n\n</details>\n\n`
}
}
const blob = new Blob([md], { type: 'text/markdown' })
return md
}
function downloadChatAsMarkdown(chat) {
const blob = new Blob([serializeChatAsMarkdown(chat)], { type: 'text/markdown' })
const url = URL.createObjectURL(blob)
const a = document.createElement('a')
a.href = url
@@ -294,7 +298,7 @@ export default function Chat() {
const {
chats, activeChat, activeChatId, isStreaming, streamingChatId, streamingContent,
streamingReasoning, streamingToolCalls, tokensPerSecond, maxTokensPerSecond,
addChat, switchChat, deleteChat, deleteAllChats, renameChat, updateChatSettings,
addChat, forkChat, switchChat, deleteChat, deleteAllChats, renameChat, updateChatSettings,
sendMessage, stopGeneration, clearHistory, getContextUsagePercent, addMessage,
} = useChat(urlModel || '')
@@ -795,34 +799,27 @@ export default function Chat() {
await sendMessage(msg, files, mcpOptions)
}, [input, files, activeChat, sendMessage, addToast, getToolsForLLM, isClientTool, executeTool, hasAppUI, getAppResource, getToolDefinition])
const handleRegenerate = useCallback(async () => {
const handleRegenerate = useCallback(async (targetIndex) => {
if (!activeChat || isStreaming) return
const history = activeChat.history
let lastUserMsg = null
let lastUserFiles = null
for (let i = history.length - 1; i >= 0; i--) {
if (history[i].role === 'user') {
lastUserMsg = typeof history[i].content === 'string' ? history[i].content : history[i].content?.[0]?.text || ''
lastUserFiles = history[i].files || []
break
}
const end = typeof targetIndex === 'number' ? targetIndex : history.length
// Nearest user message at or before the target answer.
let userIdx = -1
for (let i = Math.min(end, history.length) - 1; i >= 0; i--) {
if (history[i].role === 'user') { userIdx = i; break }
}
if (!lastUserMsg) return
// Remove everything after and including the last user message
const newHistory = []
let foundLastUser = false
for (let i = history.length - 1; i >= 0; i--) {
if (!foundLastUser && history[i].role === 'user') {
foundLastUser = true
continue
}
if (foundLastUser) {
newHistory.unshift(history[i])
}
}
updateChatSettings(activeChat.id, { history: newHistory })
await sendMessage(lastUserMsg, lastUserFiles)
if (userIdx === -1) return
const userMsg = typeof history[userIdx].content === 'string'
? history[userIdx].content
: history[userIdx].content?.[0]?.text || ''
const userFiles = history[userIdx].files || []
// Drop the user turn and everything after it; sendMessage re-appends it.
// Thread the truncated history through explicitly: updateChatSettings only
// schedules a state update, so sendMessage's closure would otherwise read
// the stale pre-truncation history for the outbound API payload.
const baseHistory = history.slice(0, userIdx)
updateChatSettings(activeChat.id, { history: baseHistory })
await sendMessage(userMsg, userFiles, { baseHistory })
}, [activeChat, isStreaming, sendMessage, updateChatSettings])
const handleKeyDown = (e) => {
@@ -852,6 +849,11 @@ export default function Chat() {
}
}
const copyChatAsMarkdown = async (chat) => {
const ok = await copyToClipboard(serializeChatAsMarkdown(chat))
addToast(ok ? t('toasts.chatCopied') : t('toasts.copyFailed'), ok ? 'success' : 'error', ok ? 2000 : 3000)
}
const contextPercent = getContextUsagePercent()
// Recent chats for the empty state — exclude the current chat and any
@@ -892,7 +894,9 @@ export default function Chat() {
onDelete={deleteChat}
onDeleteAll={promptDeleteAll}
onRename={renameChat}
onExport={(chat) => exportChatAsMarkdown(chat)}
onExport={(chat) => downloadChatAsMarkdown(chat)}
onCopyChat={(chat) => copyChatAsMarkdown(chat)}
onDuplicate={(chat) => { if (forkChat(chat.id)) addToast(t('toasts.forked'), 'success', 2000) }}
/>
{activeChat.localaiAssistant && (
<span
@@ -1184,11 +1188,19 @@ export default function Chat() {
<button onClick={() => copyMessage(msg.content)} title={t('actions.copy')}>
<i className="fas fa-copy" />
</button>
{msg.role === 'assistant' && i === activeChat.history.length - 1 && !isStreaming && (
<button onClick={handleRegenerate} title={t('actions.regenerate')}>
{msg.role === 'assistant' && !isStreaming && (
<button onClick={() => handleRegenerate(i)} title={t('actions.regenerate')}>
<i className="fas fa-rotate" />
</button>
)}
{msg.role === 'assistant' && !isStreaming && (
<button
onClick={() => { forkChat(activeChat.id, i + 1); addToast(t('toasts.forked'), 'success', 2000) }}
title={t('actions.branch')}
>
<i className="fas fa-code-branch" />
</button>
)}
</div>
</div>
</div>

View File

@@ -146,6 +146,7 @@ export default function Manage() {
const [distributedMode, setDistributedMode] = useState(false)
const [togglingModels, setTogglingModels] = useState(new Set())
const [pinningModels, setPinningModels] = useState(new Set())
const [loadingModels, setLoadingModels] = useState(new Set())
// Expanded row state — keyed by `${tab}:${id}` so switching tabs doesn't
// collide and a single row is open at a time per tab.
const [expandedKey, setExpandedKey] = useState(null)
@@ -313,6 +314,26 @@ export default function Manage() {
})
}
// Pre-load a model (or all of a realtime pipeline's sub-models) into memory.
// The /backend/load call blocks until loading finishes, so the menu item shows
// a loading state while in flight and reports the outcome on completion.
const handleLoadModel = async (modelName) => {
setLoadingModels(prev => new Set(prev).add(modelName))
try {
await backendControlApi.load({ model: modelName })
addToast(`Loaded ${modelName}`, 'success')
setTimeout(fetchLoadedModels, 500)
} catch (err) {
addToast(`Failed to load: ${err.message}`, 'error')
} finally {
setLoadingModels(prev => {
const next = new Set(prev)
next.delete(modelName)
return next
})
}
}
const handleDeleteModel = (modelName) => {
setConfirmDialog({
title: 'Delete Model',
@@ -687,6 +708,11 @@ export default function Manage() {
label: model.disabled ? 'Enable model' : 'Disable model',
onClick: () => handleToggleModel(model.id, model.disabled),
disabled: togglingModels.has(model.id) },
{ key: 'load', icon: 'fa-bolt',
label: loadingModels.has(model.id) ? 'Loading…' : 'Load into memory',
onClick: () => handleLoadModel(model.id),
hidden: isRunning || !!model.disabled,
disabled: loadingModels.has(model.id) },
{ key: 'stop', icon: 'fa-stop', label: 'Stop model',
onClick: () => handleStopModel(model.id), hidden: !isRunning },
{ key: 'pin', icon: 'fa-thumbtack',

View File

@@ -352,6 +352,9 @@ export const realtimeApi = {
// Backend control
export const backendControlApi = {
shutdown: (body) => postJSON(API_CONFIG.endpoints.backendShutdown, body),
// Pre-load a model (or all of a realtime pipeline's sub-models) into memory.
// body: { model: "<name>" }. Inverse of shutdown.
load: (body) => postJSON(API_CONFIG.endpoints.backendLoad, body),
}
// System info

View File

@@ -106,6 +106,7 @@ export const API_CONFIG = {
video: '/video',
backendMonitor: '/backend/monitor',
backendShutdown: '/backend/shutdown',
backendLoad: '/backend/load',
modelsApply: '/models/apply',
modelsDelete: (name) => `/models/delete/${name}`,
modelsAvailable: '/models/available',

View File

@@ -207,9 +207,14 @@ func RegisterLocalAIRoutes(router *echo.Echo,
backendMonitorService := monitoring.NewBackendMonitorService(ml, cl, appConfig) // Split out for now
router.GET("/backend/monitor", localai.BackendMonitorEndpoint(backendMonitorService), adminMiddleware)
router.POST("/backend/shutdown", localai.BackendShutdownEndpoint(backendMonitorService), adminMiddleware)
// /backend/load is the inverse of /backend/shutdown: pre-load a model (or all
// of a realtime pipeline's sub-models) into memory so clients can drive
// warm-up explicitly instead of paying the cold-start cost on first use.
router.POST("/backend/load", localai.LoadModelEndpoint(cl, ml, appConfig), adminMiddleware)
// The v1/* urls are exactly the same as above - makes local e2e testing easier if they are registered.
router.GET("/v1/backend/monitor", localai.BackendMonitorEndpoint(backendMonitorService), adminMiddleware)
router.POST("/v1/backend/shutdown", localai.BackendShutdownEndpoint(backendMonitorService), adminMiddleware)
router.POST("/v1/backend/load", localai.LoadModelEndpoint(cl, ml, appConfig), adminMiddleware)
// Traces and backend logs (monitoring)
router.GET("/api/traces", localai.GetAPITracesEndpoint(), adminMiddleware)
@@ -245,6 +250,7 @@ func RegisterLocalAIRoutes(router *echo.Echo,
"metrics": "/metrics",
"backend_monitor": "/backend/monitor",
"backend_shutdown": "/backend/shutdown",
"backend_load": "/backend/load",
"system": "/system",
"version": "/version",
"traces": "/api/traces",

View File

@@ -1243,6 +1243,9 @@ func RegisterUIAPIRoutes(app *echo.Echo, cl *config.ModelConfigLoader, ml *model
Galleries: appConfig.BackendGalleries,
Context: ctx,
CancelFunc: cancelFunc,
// The React UI's "Reinstall backend" action reuses this route, so
// the op must force even when the backend is already installed.
Force: true,
}
// Store cancellation function immediately so queued operations can be cancelled
galleryService.StoreCancellation(uid, cancelFunc)

View File

@@ -11,6 +11,24 @@ type BackendMonitorRequest struct {
BasicModelRequest
}
// ModelLoadRequest asks LocalAI to pre-load a model into memory by name, so the
// first request that uses it pays no cold-start load cost. For a realtime
// pipeline model, every configured sub-model (VAD, transcription, LLM, TTS,
// sound_detection, voice_recognition) is loaded instead of the pipeline stub.
// It is the inverse of the /backend/shutdown request.
type ModelLoadRequest struct {
BasicModelRequest
}
// ModelLoadResponse reports the outcome of a /backend/load call.
type ModelLoadResponse struct {
// Loaded lists the model names actually resident in memory after the call.
// For a pipeline model these are its sub-models, not the pipeline name.
Loaded []string `json:"loaded"`
// Message is a short human-readable status ("model loaded", or an error).
Message string `json:"message"`
}
type TokenMetricsRequest struct {
BasicModelRequest
}

View File

@@ -6,10 +6,39 @@ import (
"hash/fnv"
"strings"
"sync"
"time"
"gorm.io/gorm"
)
// advisoryLockWaitBackstop bounds, server-side, how long we will wait to
// acquire a blocking advisory lock when the caller's context carries no
// deadline (e.g. a startup schema migration using context.Background()). It
// only exists so such a caller cannot hang forever behind a holder whose
// session never releases the lock; it is far longer than any legitimate
// guarded section. A var (not const) so tests can shrink it.
var advisoryLockWaitBackstop = 30 * time.Minute
// advisoryLockTimeoutMargin is added to a context's remaining budget when
// deriving the server-side lock_timeout, so the Go context's own (cleaner)
// cancellation fires first and the server bound is only ever a backstop.
const advisoryLockTimeoutMargin = 30 * time.Second
// advisoryLockWaitBudget returns the server-side lock_timeout to use for a
// blocking acquire: the caller context's remaining time plus a margin (so the
// Go context still governs), or the backstop when the context has no deadline.
// Never returns zero - "wait forever" must not be possible.
func advisoryLockWaitBudget(ctx context.Context) time.Duration {
if dl, ok := ctx.Deadline(); ok {
budget := time.Until(dl) + advisoryLockTimeoutMargin
if budget < time.Second {
budget = time.Second
}
return budget
}
return advisoryLockWaitBackstop
}
// localLocks holds one buffered channel (capacity 1) per lock key, used as an
// in-process mutex for non-PostgreSQL dialects (SQLite). A SQLite auth DB is
// effectively single-process, so serializing guarded sections within this
@@ -130,6 +159,27 @@ func WithLockCtx(ctx context.Context, db *gorm.DB, key int64, fn func() error) e
}
defer conn.Close()
// Override any deployment-wide lock_timeout on this dedicated connection.
// Operators commonly set a short global lock_timeout (on the role or
// database) to bound ordinary row-lock waits. Applied to the blocking
// pg_advisory_lock below, it aborts the wait with SQLSTATE 55P03 and turns
// LocalAI's intentional cross-replica "wait your turn, then re-check"
// coordination into a hard error for the caller (e.g. a chat request that
// just wanted to reuse a model another replica is loading).
//
// We do NOT disable it outright (lock_timeout = 0 would wait forever, which
// is unsafe for the schema-migration callers that pass context.Background()).
// Instead we set a bound derived from the caller's context: its remaining
// budget plus a margin so the Go context's cancellation wins with a clean
// error, or a finite backstop when the context has no deadline.
waitBudget := advisoryLockWaitBudget(ctx)
if _, err := conn.ExecContext(ctx,
fmt.Sprintf("SET lock_timeout = %d", waitBudget.Milliseconds())); err != nil {
return fmt.Errorf("advisorylock: setting lock_timeout: %w", err)
}
// Restore the session default before this pooled connection is reused.
defer func() { _, _ = conn.ExecContext(context.Background(), "RESET lock_timeout") }()
if _, err := conn.ExecContext(ctx, "SELECT pg_advisory_lock($1)", key); err != nil {
return fmt.Errorf("advisorylock: acquiring lock %d: %w", key, err)
}

View File

@@ -158,6 +158,87 @@ var _ = Describe("AdvisoryLock", func() {
Expect(err).To(HaveOccurred())
})
It("waits out a short server-side lock_timeout instead of failing with 55P03", func() {
const lockKey int64 = 703
// Reproduce the production deployment that triggered this: a short
// global lock_timeout set on the database. Without the fix, a waiter
// blocked on pg_advisory_lock() is aborted by the server after this
// window and surfaces SQLSTATE 55P03 ("canceling statement due to
// lock timeout") to the caller instead of waiting for its turn.
Expect(db.Exec("ALTER DATABASE testdb SET lock_timeout = '300ms'").Error).ToNot(HaveOccurred())
sqlDB, err := db.DB()
Expect(err).ToNot(HaveOccurred())
// Drop pooled connections so subsequent ones reconnect and inherit
// the new database-level lock_timeout default.
sqlDB.SetMaxIdleConns(0)
holding := make(chan struct{})
released := make(chan struct{})
go func() {
defer GinkgoRecover()
herr := WithLockCtx(context.Background(), db, lockKey, func() error {
close(holding)
// Hold well past the 300ms server lock_timeout.
time.Sleep(1 * time.Second)
return nil
})
Expect(herr).ToNot(HaveOccurred())
close(released)
}()
<-holding // ensure the holder owns the lock before we contend
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
executed := false
start := time.Now()
werr := WithLockCtx(ctx, db, lockKey, func() error {
executed = true
return nil
})
Expect(werr).ToNot(HaveOccurred(),
"waiter should wait out the in-progress hold, not fail with lock_timeout (55P03)")
Expect(executed).To(BeTrue())
Expect(time.Since(start)).To(BeNumerically(">=", 400*time.Millisecond),
"waiter should have actually waited for the holder to release")
<-released
})
It("bounds a deadline-less waiter with the backstop instead of waiting forever", func() {
const lockKey int64 = 704
// A caller with no context deadline (e.g. startup schema migration
// passing context.Background()) must not hang forever if the holder
// never releases. Shrink the backstop so the test is fast.
origBackstop := advisoryLockWaitBackstop
advisoryLockWaitBackstop = 500 * time.Millisecond
DeferCleanup(func() { advisoryLockWaitBackstop = origBackstop })
holding := make(chan struct{})
release := make(chan struct{})
go func() {
defer GinkgoRecover()
_ = WithLockCtx(context.Background(), db, lockKey, func() error {
close(holding)
<-release // hold until the test releases us
return nil
})
}()
defer close(release)
<-holding
start := time.Now()
err := WithLockCtx(context.Background(), db, lockKey, func() error {
Fail("waiter should not have acquired the still-held lock")
return nil
})
Expect(err).To(HaveOccurred(), "deadline-less waiter should give up at the backstop, not hang")
Expect(time.Since(start)).To(BeNumerically("<", 5*time.Second),
"backstop must cap the wait well under the test timeout")
})
It("serializes concurrent WithLockCtx on same key", func() {
const lockKey int64 = 702

View File

@@ -0,0 +1,114 @@
package galleryop_test
import (
"context"
"os"
"path/filepath"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/gallery"
"github.com/mudler/LocalAI/core/services/galleryop"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/LocalAI/pkg/system"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"gopkg.in/yaml.v3"
)
// The install op must be idempotent unless Force is set: API clients call
// POST /backends/apply on every boot to make sure the backend exists, and an
// unconditional force here re-downloads the whole backend artifact each time.
// Reinstall is an explicit, opted-in action.
var _ = Describe("LocalBackendManager force semantics", func() {
var (
backendsDir string
srcDir string
mgr *galleryop.LocalBackendManager
systemState *system.SystemState
ml *model.ModelLoader
)
const installedRunSh = "#!/bin/sh\necho installed\n"
const galleryRunSh = "#!/bin/sh\necho from-gallery\n"
installedRunShPath := func() string {
return filepath.Join(backendsDir, "test-backend", "run.sh")
}
BeforeEach(func() {
var err error
backendsDir, err = os.MkdirTemp("", "force-backends-*")
Expect(err).NotTo(HaveOccurred())
srcDir, err = os.MkdirTemp("", "force-src-*")
Expect(err).NotTo(HaveOccurred())
// The gallery serves test-backend from a plain directory (offline).
// The gallery yaml itself must live under the backends path: file://
// galleries outside the trusted root are rejected by the downloader.
Expect(os.WriteFile(filepath.Join(srcDir, "run.sh"), []byte(galleryRunSh), 0o755)).To(Succeed())
entries := []map[string]any{{"name": "test-backend", "uri": srcDir}}
data, err := yaml.Marshal(entries)
Expect(err).NotTo(HaveOccurred())
galleryYAML := filepath.Join(backendsDir, "gallery.yaml")
Expect(os.WriteFile(galleryYAML, data, 0o644)).To(Succeed())
// test-backend is already installed, with content that differs from
// the gallery's so a reinstall is observable.
Expect(os.MkdirAll(filepath.Join(backendsDir, "test-backend"), 0o755)).To(Succeed())
Expect(os.WriteFile(installedRunShPath(), []byte(installedRunSh), 0o755)).To(Succeed())
systemState, err = system.GetSystemState(system.WithBackendPath(backendsDir))
Expect(err).NotTo(HaveOccurred())
appConfig := &config.ApplicationConfig{
SystemState: systemState,
BackendGalleries: []config.Gallery{{Name: "test", URL: "file://" + galleryYAML}},
}
ml = model.NewModelLoader(systemState)
mgr = galleryop.NewLocalBackendManager(appConfig, ml)
})
AfterEach(func() {
Expect(os.RemoveAll(backendsDir)).To(Succeed())
Expect(os.RemoveAll(srcDir)).To(Succeed())
})
It("skips an already-installed backend when Force is not set", func() {
op := &galleryop.ManagementOp[gallery.GalleryBackend, any]{
ID: "op-1",
GalleryElementName: "test-backend",
}
Expect(mgr.InstallBackend(context.Background(), op, nil)).To(Succeed())
content, err := os.ReadFile(installedRunShPath())
Expect(err).NotTo(HaveOccurred())
Expect(string(content)).To(Equal(installedRunSh), "install without Force must not overwrite an installed backend")
})
It("reinstalls an already-installed backend when Force is set", func() {
op := &galleryop.ManagementOp[gallery.GalleryBackend, any]{
ID: "op-2",
GalleryElementName: "test-backend",
Force: true,
}
Expect(mgr.InstallBackend(context.Background(), op, nil)).To(Succeed())
content, err := os.ReadFile(installedRunShPath())
Expect(err).NotTo(HaveOccurred())
Expect(string(content)).To(Equal(galleryRunSh), "install with Force must overwrite the installed backend")
})
// The LOCALAI_EXTERNAL_BACKENDS boot loop goes through
// InstallExternalBackend's gallery-name path on EVERY startup; it must not
// force, or each boot re-downloads every listed backend.
It("skips an already-installed backend on the non-forced external gallery-name path", func() {
err := galleryop.InstallExternalBackend(context.Background(),
[]config.Gallery{{Name: "test", URL: "file://" + filepath.Join(backendsDir, "gallery.yaml")}},
systemState, ml, nil, "test-backend", "", "", false, false)
Expect(err).NotTo(HaveOccurred())
content, err := os.ReadFile(installedRunShPath())
Expect(err).NotTo(HaveOccurred())
Expect(string(content)).To(Equal(installedRunSh), "non-forced external install must not overwrite an installed backend")
})
})

View File

@@ -144,7 +144,12 @@ func (g *GalleryService) backendHandler(op *ManagementOp[gallery.GalleryBackend,
// InstallExternalBackend installs a backend from an external source (OCI image, URL, or path).
// This method contains the logic to detect the input type and call the appropriate installation function.
// It can be used by both CLI and Web UI for installing backends from external sources.
func InstallExternalBackend(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, modelLoader *model.ModelLoader, downloadStatus func(string, string, string, float64), backend, name, alias string, requireIntegrity bool) error {
//
// force applies only to the gallery-name fallback: a URI install (dir/OCI/file)
// always writes, but a bare gallery name is an "ensure installed" — the
// LOCALAI_EXTERNAL_BACKENDS boot loop runs it on every start and must not
// re-download an installed, runnable backend.
func InstallExternalBackend(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, modelLoader *model.ModelLoader, downloadStatus func(string, string, string, float64), backend, name, alias string, force, requireIntegrity bool) error {
uri := downloader.URI(backend)
switch {
case uri.LooksLikeDir():
@@ -202,7 +207,7 @@ func InstallExternalBackend(ctx context.Context, galleries []config.Gallery, sys
if name != "" || alias != "" {
return fmt.Errorf("specifying a name or alias is not supported for gallery backends")
}
err := gallery.InstallBackendFromGallery(ctx, galleries, systemState, modelLoader, backend, downloadStatus, true, requireIntegrity)
err := gallery.InstallBackendFromGallery(ctx, galleries, systemState, modelLoader, backend, downloadStatus, force, requireIntegrity)
if err != nil {
return fmt.Errorf("error installing backend %s: %w", backend, err)
}

View File

@@ -70,6 +70,7 @@ var _ = Describe("InstallExternalBackend", func() {
"test-backend", // gallery name
"custom-name", // name should not be allowed
"",
false, // force
false,
)
Expect(err).To(HaveOccurred())
@@ -86,6 +87,7 @@ var _ = Describe("InstallExternalBackend", func() {
"non-existent-backend",
"",
"",
false, // force
false,
)
Expect(err).To(HaveOccurred())
@@ -103,6 +105,7 @@ var _ = Describe("InstallExternalBackend", func() {
"oci://quay.io/mudler/tests:localai-backend-test",
"", // name is required for OCI images
"",
false, // force
false,
)
Expect(err).To(HaveOccurred())
@@ -136,6 +139,7 @@ var _ = Describe("InstallExternalBackend", func() {
testBackendPath,
"", // name should be inferred as "source-backend"
"",
false, // force
false,
)
// The function should at least attempt to install with the inferred name
@@ -155,6 +159,7 @@ var _ = Describe("InstallExternalBackend", func() {
testBackendPath,
"custom-backend-name",
"",
false, // force
false,
)
// The function should use the provided name
@@ -173,6 +178,7 @@ var _ = Describe("InstallExternalBackend", func() {
testBackendPath,
"custom-backend-name",
"custom-alias",
false, // force
false,
)
// The function should accept alias for directory paths

View File

@@ -110,10 +110,13 @@ func (b *LocalBackendManager) CheckUpgrades(ctx context.Context) (map[string]gal
func (b *LocalBackendManager) InstallBackend(ctx context.Context, op *ManagementOp[gallery.GalleryBackend, any], progressCb ProgressCallback) error {
if op.ExternalURI != "" {
return InstallExternalBackend(ctx, b.backendGalleries, b.systemState, b.modelLoader,
progressCb, op.ExternalURI, op.ExternalName, op.ExternalAlias, b.requireBackendIntegrity)
progressCb, op.ExternalURI, op.ExternalName, op.ExternalAlias, op.Force, b.requireBackendIntegrity)
}
// op.Force distinguishes an explicit reinstall from an idempotent
// "make sure it's installed" op; the latter must not re-download an
// already-runnable backend (supervisors apply on every boot).
return gallery.InstallBackendFromGallery(ctx, b.backendGalleries, b.systemState,
b.modelLoader, op.GalleryElementName, progressCb, true, b.requireBackendIntegrity)
b.modelLoader, op.GalleryElementName, progressCb, op.Force, b.requireBackendIntegrity)
}
func (b *LocalBackendManager) IsDistributed() bool { return false }

View File

@@ -45,6 +45,13 @@ type ManagementOp[T any, E any] struct {
// Upgrade is true if this is an upgrade operation (not a fresh install)
Upgrade bool
// Force reinstalls a backend even when it is already installed and
// runnable. Without it a backend install op is idempotent — API clients
// that ensure a backend exists on every boot must not trigger a full
// artifact re-download each time. The UI's explicit "Reinstall backend"
// action sets it.
Force bool
}
type OpStatus struct {

View File

@@ -68,6 +68,13 @@ type SmartRouterOptions struct {
// the absolute model paths untouched so the worker loads them directly from
// the shared volume (#10556). See config.DistributedConfig.SharedModels.
SharedModels bool
// ModelLoadCeiling is the hard upper bound on how long a single cold-load
// attempt (node selection -> backend install -> file staging -> LoadModel)
// may run while holding the per-model advisory lock. It backstops every
// sub-step's own timeout so a wedged worker can never pin the lock - and
// every other replica's request for that model - indefinitely. Zero selects
// defaultModelLoadCeiling.
ModelLoadCeiling time.Duration
}
// SmartRouter routes inference requests to the best available backend node.
@@ -101,8 +108,18 @@ type SmartRouter struct {
// sharedModels skips file staging when all nodes mount the same models
// directory at the same path (see SmartRouterOptions.SharedModels).
sharedModels bool
// modelLoadCeiling bounds how long a cold load may hold the per-model
// advisory lock (see SmartRouterOptions.ModelLoadCeiling).
modelLoadCeiling time.Duration
}
// defaultModelLoadCeiling is the fallback hold ceiling for a cold model load.
// It must comfortably exceed the slowest legitimate load - a multi-GB backend
// install (DefaultBackendInstallTimeout, 15m) plus staging and the remote
// LoadModel (5m) - so it never cuts a real load short; it only ever fires when
// a step is genuinely wedged (e.g. a worker that died mid-install).
const defaultModelLoadCeiling = 25 * time.Minute
// probeCacheTTL is how long a successful gRPC HealthCheck on a backend is
// trusted before the next request re-probes. Matches healthCheckTTL in
// pkg/model/model.go so the single-process and distributed paths share a
@@ -117,6 +134,10 @@ func NewSmartRouter(registry ModelRouter, opts SmartRouterOptions) *SmartRouter
if factory == nil {
factory = &tokenClientFactory{token: opts.AuthToken}
}
ceiling := opts.ModelLoadCeiling
if ceiling <= 0 {
ceiling = defaultModelLoadCeiling
}
return &SmartRouter{
registry: registry,
unloader: opts.Unloader,
@@ -131,6 +152,7 @@ func NewSmartRouter(registry ModelRouter, opts SmartRouterOptions) *SmartRouter
prefixConfig: opts.PrefixConfig,
pressure: opts.Pressure,
sharedModels: opts.SharedModels,
modelLoadCeiling: ceiling,
}
}
@@ -383,11 +405,19 @@ func (r *SmartRouter) Route(ctx context.Context, modelID, modelName, backendType
// the request context. If staging were bound to it, the multi-GB upload
// aborts with "context canceled" mid-transfer and large models can never
// finish staging (the model-load outage). WithoutCancel keeps the request's
// values (prefix chain, etc.) but drops its cancellation/deadline. Each
// long step still has its own bound (the file stager's resume budget,
// LoadModel's 5m timeout), and the per-model advisory lock below de-dupes
// concurrent loaders across replicas.
loadCtx := context.WithoutCancel(ctx)
// values (prefix chain, etc.) but drops its cancellation/deadline.
//
// Detaching from the caller is necessary, but it must not be unbounded: the
// load runs while holding the per-model advisory lock, and a worker that
// dies mid-install (its backend.install never replies) would otherwise pin
// that lock (and every other replica's request for the same model) until
// the NATS install deadline alone expires. Re-impose a single hard ceiling
// over the whole sequence so the lock is always released in bounded time,
// even if a sub-step wedges. Each long step still has its own (tighter)
// bound; this only backstops them. The per-model advisory lock below
// de-dupes concurrent loaders across replicas.
loadCtx, cancelLoad := context.WithTimeout(context.WithoutCancel(ctx), r.modelLoadCeiling)
defer cancelLoad()
loadModel := func(ctx context.Context) (*RouteResult, error) {
// Re-check after acquiring lock — another request may have loaded it
node, nm, err := r.registry.FindAndLockNodeWithModel(ctx, trackingKey, candidateNodeIDs, pref)
@@ -916,7 +946,14 @@ func (r *SmartRouter) installBackendOnNode(ctx context.Context, node *BackendNod
}
key := fmt.Sprintf("%s|%s|%s|%d", node.ID, backendType, modelID, replicaIndex)
v, err, _ := r.installFlight.Do(key, func() (any, error) {
// DoChan rather than Do so this wait honors ctx cancellation. InstallBackend
// blocks for its full NATS deadline (15m by default) when a worker accepts
// the request but never replies (e.g. it died mid-install). Without ctx
// awareness the caller (holding the per-model advisory lock) would sit there
// the whole time; here a cancelled ctx (typically the model-load ceiling)
// frees the caller promptly. The shared install keeps running in the
// background and still coalesces other callers via singleflight.
resCh := r.installFlight.DoChan(key, func() (any, error) {
reply, err := r.unloader.InstallBackend(node.ID, backendType, modelID, r.galleriesJSON, "", "", "", replicaIndex, "", nil)
if err != nil {
return "", err
@@ -931,10 +968,15 @@ func (r *SmartRouter) installBackendOnNode(ctx context.Context, node *BackendNod
}
return addr, nil
})
if err != nil {
return "", err
select {
case <-ctx.Done():
return "", ctx.Err()
case res := <-resCh:
if res.Err != nil {
return "", res.Err
}
return res.Val.(string), nil
}
return v.(string), nil
}
func (r *SmartRouter) buildClientForAddr(node *BackendNode, addr string, parallel bool) grpc.Backend {

View File

@@ -493,6 +493,44 @@ var _ = Describe("SmartRouter", func() {
Expect(result.Node.ID).To(Equal("n3"))
})
})
Context("worker wedges mid-install (dead node holding the lock)", func() {
It("aborts the load at the ModelLoadCeiling instead of blocking forever", func() {
// Simulate the production incident: the chosen worker accepts the
// backend.install but never replies (it died), so InstallBackend
// would otherwise block for its full NATS deadline (15m by
// default) while pinning the per-model advisory lock. Route must
// give up at the ceiling so the lock is released promptly.
reg.findAndLockErr = errors.New("not found")
reg.findIdleNode = &BackendNode{ID: "n4", Name: "dead-node", Address: "10.0.0.4:50051"}
block := make(chan struct{})
defer close(block) // let the background install goroutine drain at test end
unloader.installHook = func() { <-block }
router := NewSmartRouter(reg, SmartRouterOptions{
Unloader: unloader,
ClientFactory: factory,
ModelLoadCeiling: 200 * time.Millisecond,
})
done := make(chan error, 1)
start := time.Now()
go func() {
defer GinkgoRecover()
_, err := router.Route(context.Background(), "wedged-model",
"models/wedged.gguf", "llama-cpp",
&pb.ModelOptions{Model: "models/wedged.gguf"}, false)
done <- err
}()
var routeErr error
Eventually(done, 5*time.Second).Should(Receive(&routeErr),
"Route must not block on a wedged install past the ceiling")
Expect(routeErr).To(HaveOccurred())
Expect(time.Since(start)).To(BeNumerically("<", 5*time.Second))
})
})
})
Describe("scheduleNewModel (mock-based, via Route)", func() {

View File

@@ -0,0 +1,48 @@
package pii
import (
"context"
"sync"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/metric"
)
// Prometheus counter for PII events. The EventStore ring buffer is
// capacity-bound and meant for recent-audit browsing; operators also want
// a monotonic, scrape-friendly signal ("how many detections/blocks per
// hour, did the filter stop firing after a deploy"). Record() is the
// single choke point every producer already goes through (request
// middleware, response scrubbing, MITM proxy connects/intercepts), so one
// counter here covers all paths without touching the producers.
//
// Initialised lazily on first Record so the package works no matter when
// (or whether) the Prometheus-backed global MeterProvider is installed —
// same pattern as core/services/routing/billing.
var (
metricsOnce sync.Once
eventsCounter metric.Int64Counter
)
func recordEventMetric(e PIIEvent) {
metricsOnce.Do(func() {
meter := otel.Meter("github.com/mudler/LocalAI")
c, err := meter.Int64Counter(
"localai_pii_events_total",
metric.WithDescription("PII/audit events recorded, labeled by kind, origin, action and direction"),
)
if err == nil {
eventsCounter = c
}
})
if eventsCounter == nil {
return
}
eventsCounter.Add(context.Background(), 1, metric.WithAttributes(
attribute.String("kind", string(e.Kind)),
attribute.String("origin", string(e.Origin)),
attribute.String("action", string(e.Action)),
attribute.String("direction", string(e.Direction)),
))
}

View File

@@ -58,6 +58,7 @@ type memoryEventStore struct {
}
func (s *memoryEventStore) Record(_ context.Context, e PIIEvent) error {
recordEventMetric(e)
s.mu.Lock()
defer s.mu.Unlock()
s.ring[s.cursor] = e

View File

@@ -134,7 +134,7 @@ func (s *backendSupervisor) installBackend(req messaging.BackendInstallRequest,
if req.URI != "" {
xlog.Info("Installing backend from external URI", "backend", req.Backend, "uri", req.URI, "force", force)
if err := galleryop.InstallExternalBackend(
context.Background(), galleries, s.systemState, s.ml, downloadCb, req.URI, req.Name, req.Alias, s.cfg.RequireBackendIntegrity,
context.Background(), galleries, s.systemState, s.ml, downloadCb, req.URI, req.Name, req.Alias, force, s.cfg.RequireBackendIntegrity,
); err != nil {
return "", fmt.Errorf("installing backend from gallery: %w", err)
}
@@ -201,7 +201,7 @@ func (s *backendSupervisor) upgradeBackend(req messaging.BackendUpgradeRequest)
if req.URI != "" {
xlog.Info("Upgrading backend from external URI", "backend", req.Backend, "uri", req.URI)
if err := galleryop.InstallExternalBackend(
context.Background(), galleries, s.systemState, s.ml, downloadCb, req.URI, req.Name, req.Alias, s.cfg.RequireBackendIntegrity,
context.Background(), galleries, s.systemState, s.ml, downloadCb, req.URI, req.Name, req.Alias, true, s.cfg.RequireBackendIntegrity,
); err != nil {
return fmt.Errorf("upgrading backend from external URI: %w", err)
}

View File

@@ -14,6 +14,16 @@ import (
// MaxSnippetSeconds is the maximum number of seconds of audio captured per trace.
const MaxSnippetSeconds = 30
// silenceFloorDBFS is the dBFS value reported for digital silence (RMS or peak
// of zero). The true level is -∞ dBFS; reporting a finite floor keeps the
// metric present and meaningful in the Traces UI (a scrubbed nil would read as
// "missing" rather than "silent"). -120 dBFS sits well below 16-bit PCM's
// ~-90 dBFS least-significant-bit floor, so it reads unambiguously as
// "effectively silent". JSON-marshal safety for any non-finite float that does
// reach a trace is owned centrally by RecordBackendTrace's sanitizer — this
// floor is about presentation, not transport.
const silenceFloorDBFS = -120.0
// AudioSnippet captures the first MaxSnippetSeconds of a WAV file and computes
// quality metrics. The result is a map suitable for merging into a BackendTrace
// Data field. maxBytes caps the embedded base64 waveform so a single TTS or
@@ -63,7 +73,7 @@ func AudioSnippetFromPCM(pcm []byte, sampleRate, totalPCMBytes, maxBytes int) ma
snippetDuration := float64(len(samples)) / float64(sampleRate)
rms := sound.CalculateRMS16(samples)
rmsDBFS := -math.Inf(1)
rmsDBFS := silenceFloorDBFS
if rms > 0 {
rmsDBFS = 20 * math.Log10(rms/32768.0)
}
@@ -78,7 +88,7 @@ func AudioSnippetFromPCM(pcm []byte, sampleRate, totalPCMBytes, maxBytes int) ma
}
dcSum += int64(s)
}
peakDBFS := -math.Inf(1)
peakDBFS := silenceFloorDBFS
if peak > 0 {
peakDBFS = 20 * math.Log10(float64(peak)/32768.0)
}

View File

@@ -1,6 +1,9 @@
package trace_test
import (
"encoding/json"
"math"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
@@ -47,3 +50,32 @@ var _ = Describe("AudioSnippetFromPCM byte cap", func() {
Expect(out).To(HaveKey("audio_wav_base64"))
})
})
// Silent audio (RMS/peak of zero) has a true level of -∞ dBFS, but emitting
// -Inf made the whole /api/backend-traces response fail to JSON-marshal and
// blanked the Traces UI. The metrics must instead be finite and serializable.
var _ = Describe("AudioSnippetFromPCM silent audio dBFS", func() {
pcm := makePCM(snippetSeconds, snippetSampleRate) // all zeros == digital silence
totalPCM := len(pcm)
It("reports finite dBFS for silence instead of -Inf", func() {
out := trace.AudioSnippetFromPCM(pcm, snippetSampleRate, totalPCM, 0)
rms, ok := out["audio_rms_dbfs"].(float64)
Expect(ok).To(BeTrue())
Expect(math.IsInf(rms, 0)).To(BeFalse(), "silent RMS must not be ±Inf")
Expect(math.IsNaN(rms)).To(BeFalse())
peak, ok := out["audio_peak_dbfs"].(float64)
Expect(ok).To(BeTrue())
Expect(math.IsInf(peak, 0)).To(BeFalse(), "silent peak must not be ±Inf")
Expect(math.IsNaN(peak)).To(BeFalse())
})
It("produces a snippet that round-trips through encoding/json", func() {
out := trace.AudioSnippetFromPCM(pcm, snippetSampleRate, totalPCM, 0)
_, err := json.Marshal(out)
Expect(err).ToNot(HaveOccurred(), "silent-audio metrics must be JSON-marshalable")
})
})

View File

@@ -3,6 +3,8 @@ package trace
import (
"encoding/json"
"fmt"
"maps"
"math"
"slices"
"sync"
"time"
@@ -116,8 +118,13 @@ func RecordBackendTrace(t BackendTrace) {
backendMu.Lock()
maxBody := backendMaxBodyBytes
backendMu.Unlock()
if t.Data != nil && maxBody > 0 {
t.Data = capDataStrings(t.Data, maxBody)
// Always walk Data, even with no body cap configured: besides capping
// oversized strings (maxBody > 0), the walk replaces non-finite floats
// (Inf/NaN) that encoding/json cannot marshal. A single such value — e.g. a
// -Inf dBFS audio metric from a silent clip — would otherwise fail the whole
// /api/backend-traces response and blank the Traces UI.
if t.Data != nil {
t.Data = sanitizeData(t.Data, maxBody)
}
select {
case backendLogChan <- &t:
@@ -126,32 +133,90 @@ func RecordBackendTrace(t BackendTrace) {
}
}
// capDataStrings walks a trace Data map and replaces any string value (at any
// depth) that exceeds maxBytes with a fixed-size marker that names the
// original byte count. The replacement is intentionally short and not valid
// base64/JSON: the goal is to flag "this was dropped" cheaply, not to keep a
// partial value that the UI might try to render. Non-string scalars and
// non-map containers pass through untouched so structural fields like
// total_deltas or audio_sample_rate remain useful.
func capDataStrings(data map[string]any, maxBytes int) map[string]any {
out := make(map[string]any, len(data))
for k, v := range data {
out[k] = capValue(v, maxBytes)
}
// sanitizeData walks a trace Data map (recursing into nested maps and slices)
// and makes every value safe for the /api/backend-traces JSON response:
//
// - When maxBytes > 0, any string longer than maxBytes is replaced with a
// fixed-size marker that names the original byte count. The replacement is
// intentionally short and not valid base64/JSON: it flags "this was dropped"
// cheaply rather than keeping a partial value the UI might try to render.
// - Non-finite floats (Inf/NaN) are replaced with nil regardless of maxBytes,
// because encoding/json refuses to marshal them and one bad value would fail
// the entire response.
//
// Other scalars (ints, bools, finite floats) pass through untouched so
// structural fields like total_deltas or audio_sample_rate remain useful.
//
// The walk is copy-on-write: it runs on every RecordBackendTrace call, and in
// the common case nothing needs rewriting, so containers are only re-allocated
// on the paths that actually changed and untouched values keep their original
// interface boxes instead of paying a per-value re-boxing allocation.
func sanitizeData(data map[string]any, maxBytes int) map[string]any {
out, _ := sanitizeMap(data, maxBytes)
return out
}
func capValue(v any, maxBytes int) any {
func sanitizeMap(m map[string]any, maxBytes int) (map[string]any, bool) {
var out map[string]any
for k, v := range m {
nv, changed := sanitizeValue(v, maxBytes)
if changed && out == nil {
// First change: fork the map. Entries already visited were
// unchanged, so a full copy then overwriting as we go is exact.
out = make(map[string]any, len(m))
maps.Copy(out, m)
}
if out != nil {
out[k] = nv
}
}
if out == nil {
return m, false
}
return out, true
}
func sanitizeSlice(s []any, maxBytes int) ([]any, bool) {
var out []any
for i, v := range s {
nv, changed := sanitizeValue(v, maxBytes)
if changed && out == nil {
out = make([]any, len(s))
copy(out, s)
}
if out != nil {
out[i] = nv
}
}
if out == nil {
return s, false
}
return out, true
}
func sanitizeValue(v any, maxBytes int) (any, bool) {
switch val := v.(type) {
case string:
if len(val) > maxBytes {
return fmt.Sprintf("<truncated: %d bytes>", len(val))
if maxBytes > 0 && len(val) > maxBytes {
return fmt.Sprintf("<truncated: %d bytes>", len(val)), true
}
return val
return v, false
case float64:
if math.IsInf(val, 0) || math.IsNaN(val) {
return nil, true
}
return v, false
case float32:
if f := float64(val); math.IsInf(f, 0) || math.IsNaN(f) {
return nil, true
}
return v, false
case map[string]any:
return capDataStrings(val, maxBytes)
return sanitizeMap(val, maxBytes)
case []any:
return sanitizeSlice(val, maxBytes)
default:
return v
return v, false
}
}

View File

@@ -0,0 +1,80 @@
package trace_test
import (
"encoding/json"
"math"
"time"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/trace"
)
// encoding/json cannot marshal ±Inf or NaN. The /api/backend-traces endpoint
// serializes the whole buffer with one json call, so a single non-finite float
// in any trace's Data map (e.g. a -Inf dBFS audio metric from a silent clip)
// would fail the entire response and blank the Traces UI. RecordBackendTrace
// must scrub those values regardless of whether a body cap is configured.
var _ = Describe("RecordBackendTrace non-finite float sanitization", func() {
BeforeEach(func() {
// maxBodyBytes 0 == no body cap: float sanitization must still run.
trace.InitBackendTracingIfEnabled(64, 0)
trace.ClearBackendTraces()
})
It("replaces ±Inf and NaN with nil so the response stays JSON-marshalable", func() {
trace.RecordBackendTrace(trace.BackendTrace{
Timestamp: time.Now(),
Type: trace.BackendTraceTranscription,
ModelName: "m",
Data: map[string]any{
"audio_rms_dbfs": math.Inf(-1),
"audio_peak_dbfs": math.Inf(1),
"weird": math.NaN(),
"audio_duration_s": 1.5, // finite siblings must survive
},
})
Eventually(trace.GetBackendTraces).Should(HaveLen(1))
got := trace.GetBackendTraces()[0]
Expect(got.Data["audio_rms_dbfs"]).To(BeNil())
Expect(got.Data["audio_peak_dbfs"]).To(BeNil())
Expect(got.Data["weird"]).To(BeNil())
Expect(got.Data["audio_duration_s"]).To(Equal(1.5), "finite floats must pass through untouched")
_, err := json.Marshal(trace.GetBackendTraces())
Expect(err).ToNot(HaveOccurred(), "the whole trace buffer must marshal even with non-finite inputs")
})
It("scrubs non-finite floats nested in maps and slices", func() {
trace.RecordBackendTrace(trace.BackendTrace{
Timestamp: time.Now(),
Type: trace.BackendTraceLLM,
ModelName: "m",
Data: map[string]any{
"nested": map[string]any{
"logprob": math.Inf(-1),
"ok": 0.25,
},
"scores": []any{1.0, math.Inf(1), math.NaN()},
},
})
Eventually(trace.GetBackendTraces).Should(HaveLen(1))
got := trace.GetBackendTraces()[0]
nested := got.Data["nested"].(map[string]any)
Expect(nested["logprob"]).To(BeNil())
Expect(nested["ok"]).To(Equal(0.25))
scores := got.Data["scores"].([]any)
Expect(scores[0]).To(Equal(1.0))
Expect(scores[1]).To(BeNil())
Expect(scores[2]).To(BeNil())
_, err := json.Marshal(trace.GetBackendTraces())
Expect(err).ToNot(HaveOccurred())
})
})

View File

@@ -381,6 +381,8 @@ curl -X POST http://localhost:8080/backend/shutdown \
To stop all models, you'll need to call the endpoint for each loaded model individually, or use the web UI to stop all models at once.
Conversely, you can pre-load a model into memory ahead of its first request with `POST /backend/load` (the inverse of shutdown) — see [Backend Monitor]({{%relref "features/backend-monitor" %}}).
### Best Practices
1. **Monitor VRAM usage**: Use `nvidia-smi` (for NVIDIA GPUs) or similar tools to monitor actual VRAM usage

View File

@@ -166,7 +166,7 @@ When authentication is enabled, the following endpoints require admin role:
- `GET /api/backend-traces`, `POST /api/backend-traces/clear`
- `GET /api/backend-logs/*`, `POST /api/backend-logs/*/clear`
- `GET /api/resources`, `GET /api/settings`, `POST /api/settings`
- `GET /system`, `GET /backend/monitor`, `POST /backend/shutdown`
- `GET /system`, `GET /backend/monitor`, `POST /backend/shutdown`, `POST /backend/load`
**P2P:**
- `GET /api/p2p/*`

View File

@@ -5,7 +5,9 @@ weight = 20
url = "/features/backend-monitor/"
+++
LocalAI provides endpoints to monitor and manage running backends. The `/backend/monitor` endpoint reports the status and resource usage of loaded models, and `/backend/shutdown` allows stopping a model's backend process.
LocalAI provides endpoints to monitor and manage running backends. The `/backend/monitor` endpoint reports the status and resource usage of loaded models, `/backend/load` pre-loads a model into memory, and `/backend/shutdown` allows stopping a model's backend process.
All three are admin-only.
## Monitor API
@@ -62,6 +64,42 @@ curl "http://localhost:8080/backend/monitor?model=my-model"
}
```
## Load API
Pre-loads a model into memory ahead of its first request, so that request pays no cold-start load cost. It is the inverse of the Shutdown API and works for any model, not just realtime pipelines.
- **Method:** `POST`
- **Endpoints:** `/backend/load`, `/v1/backend/load`
### Request
| Parameter | Type | Required | Description |
|-----------|----------|----------|------------------------------|
| `model` | `string` | Yes | Name of the model to load |
### Behavior
- For a regular model, its own backend is loaded.
- For a **realtime pipeline** model (a config with a `pipeline:` block), every configured sub-model (VAD, transcription, LLM, TTS, sound_detection, voice_recognition) is loaded concurrently instead of the pipeline stub, which has no backend of its own.
The call blocks until loading finishes and reports which model names became resident, so partial failures are visible.
### Usage
```bash
curl -X POST http://localhost:8080/backend/load \
-H "Content-Type: application/json" \
-d '{"model": "my-model"}'
```
### Example response
```json
{ "loaded": ["my-model"], "message": "model loaded" }
```
On failure the call returns `500` with `loaded` listing whichever sub-models did load and `message` naming the failures.
## Shutdown API
- **Method:** `POST`

View File

@@ -56,6 +56,39 @@ pipeline:
All streaming flags are off by default, so existing pipelines are unaffected.
### Model warm-up (cold start)
Without warm-up the pipeline's models are loaded into memory only on first use *within* a session: the VAD on the first audio chunk, transcription at the first end-of-speech, the LLM on the first reply, and TTS on the first spoken output. On a cold session this staggers a load delay across those first few interactions — and a model that fails to load (missing weights, wrong backend, out of memory) only fails part-way through the first turn.
To avoid that, LocalAI **warms the pipeline by default**: it loads the VAD, transcription, LLM and TTS backends into memory *before* the session is announced, and the session start **blocks until they are all ready**. The loads run concurrently, so the wait is the slowest single model, not the sum. This means:
- The first turn pays no cold-start cost — every backend is already resident.
- **Model-load errors surface at session start.** If any stage fails to load, the session is not started and the client receives a `model_load_error` instead of `session.created`, so a broken pipeline fails fast and visibly rather than mid-call.
Set `disable_warmup: true` to restore the lazy "load on first use" behavior — session start no longer waits on loading and load errors surface on the first turn instead. Useful if you want idle sessions to avoid holding model memory they may never use:
```yaml
name: gpt-realtime
pipeline:
vad: silero-vad-ggml
transcription: whisper-large-turbo
llm: qwen3-4b
tts: tts-1
disable_warmup: true # lazily load each model on first use instead of at session start
```
#### Pre-loading a pipeline on demand
Warm-up only fires when a realtime session opens. To load a pipeline into memory ahead of time — e.g. to warm it right after boot, or when running with `disable_warmup: true` — POST the model name to the admin-only `/backend/load` endpoint. For a pipeline model it loads every configured sub-model (VAD, transcription, LLM, TTS, sound_detection, voice_recognition) concurrently:
```bash
curl -X POST http://localhost:8080/backend/load \
-H "Content-Type: application/json" \
-d '{"model": "gpt-realtime"}'
```
The endpoint is not realtime-specific — it pre-loads any model. See [Backend Monitor]({{%relref "features/backend-monitor" %}}) for the full request/response reference (it is the inverse of `/backend/shutdown`).
### Turn detection
Turn detection decides when the user has finished speaking and the pipeline should respond. Two modes are supported, matching the OpenAI session schema:

View File

@@ -507,7 +507,7 @@ The `llama.cpp` backend supports additional configuration options that can be sp
| `fit_params_min_ctx` or `fit_ctx` | integer | Minimum context size that can be set by fit_params. Default: `4096`. | `fit_ctx:2048` |
| `n_cache_reuse` or `cache_reuse` | integer | Minimum chunk size to attempt reusing from the cache via KV shifting. Default: `0` (disabled). | `cache_reuse:256` |
| `slot_prompt_similarity` or `sps` | float | How much the prompt of a request must match the prompt of a slot to use that slot. Default: `0.1`. Set to `0` to disable. | `sps:0.5` |
| `swa_full` | boolean | Use full-size SWA (Sliding Window Attention) cache. Upstream default is `false` (a memory-light reduced cache), but that reduced cache cannot reuse a prompt prefix across requests, which defeats `cache_reuse` for SWA models (Gemma 2/3, Cohere2, Llama 4, ...). LocalAI therefore **auto-enables `swa_full:true` for GGUF models detected as SWA** so the cross-request prefix cache works; it is left off for dense models. The tradeoff is memory: the full SWA cache scales with `context_size`. Set `swa_full:false` explicitly to opt back out (e.g. to save memory at a large context). | `swa_full:true` |
| `swa_full` | boolean | Use full-size SWA (Sliding Window Attention) cache. Default: `false`. | `swa_full:true` |
| `cont_batching` or `continuous_batching` | boolean | Enable continuous batching for handling multiple sequences. Default: `true`. | `cont_batching:true` |
| `check_tensors` | boolean | Validate tensor data for invalid values during model loading. Default: `false`. | `check_tensors:true` |
| `warmup` | boolean | Enable warmup run after model loading. Default: `true`. | `warmup:false` |

View File

@@ -1,3 +1,3 @@
{
"version": "v4.5.5"
"version": "v4.5.6"
}

View File

@@ -1,4 +1,113 @@
---
- name: "qwopus3.6-35b-a3b-coder-mtp"
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
urls:
- https://huggingface.co/Jackrong/Qwopus3.6-35B-A3B-Coder-MTP-GGUF
description: |
# 🌟 Qwopus3.6-35B-A3B-v1
## 💡 Base Model Overview
**Qwen3.6-35B-A3B** is an advanced hybrid sparse MoE (Mixture-of-Experts) model developed by Alibaba Cloud. It features 35B total parameters with only 3B active parameters per token, ensuring high inference efficiency. Architecturally, it combines Gated DeltaNet linear attention with standard gated attention layers, routing tokens across **256 experts**. It natively supports a massive **262k context window** and is specifically designed for high-performance agentic coding, deep reasoning, and multimodal tasks.
## 🚀 Model Refinement & Logic Tuning Qwopus3.6-35B-A3B-v1
🪐**Qwopus3.6-35B-A3B-v1** is a reasoning-enhanced MoE (Mixture of Experts) model fine-tuned on top of **Qwen3.6-35B-A3B**.
### 🛠 Training Strategy
The fine-tuning process for this model is structured into **three distinct stages of distributed SFT (Supervised Fine-Tuning)**, progressively scaling reasoning complexity and data diversity. This systematic approach ensures the model inherits the base MoE capabilities while sharpening its logic-handling depth.
...
license: "apache-2.0"
tags:
- llm
- gguf
- vision
- multimodal
icon: https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/ztbyGV_zGhzcLuTCSVyq3.png
overrides:
backend: llama-cpp
function:
automatic_tool_parsing_fallback: true
grammar:
disable: true
known_usecases:
- chat
mmproj: llama-cpp/mmproj/Qwopus3.6-35B-A3B-Coder-MTP-Q4_K_M/mmproj-F32.gguf
options:
- use_jinja:true
- spec_type:draft-mtp
- spec_n_max:6
- spec_p_min:0.75
parameters:
model: llama-cpp/models/Qwopus3.6-35B-A3B-Coder-MTP-Q4_K_M/Qwopus3.6-35B-A3B-Coder-MTP-Q4_K_M.gguf
template:
use_tokenizer_template: true
files:
- filename: llama-cpp/models/Qwopus3.6-35B-A3B-Coder-MTP-Q4_K_M/Qwopus3.6-35B-A3B-Coder-MTP-Q4_K_M.gguf
sha256: c283cd2321a3cb4c6e7faf9481ac7d946913e4f02e20172eb2872112f567d8d4
uri: https://huggingface.co/Jackrong/Qwopus3.6-35B-A3B-Coder-MTP-GGUF/resolve/main/Qwopus3.6-35B-A3B-Coder-MTP-Q4_K_M.gguf
- filename: llama-cpp/mmproj/Qwopus3.6-35B-A3B-Coder-MTP-Q4_K_M/mmproj-F32.gguf
sha256: 5c82c8095717b39f29c88ebfec3607a10307785b1e14a87744603d6c582cd497
uri: https://huggingface.co/Jackrong/Qwopus3.6-35B-A3B-Coder-MTP-GGUF/resolve/main/mmproj-F32.gguf
- name: "ornith-1.0-9b-mtp"
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
urls:
- https://huggingface.co/protoLabsAI/Ornith-1.0-9B-MTP-GGUF
description: |
[](https://deep-reinforce.com/ornith.html)
# Ornith-1.0-9B
Aloha! 🌺 Today, we are releasing Ornith-1.0, a self-improving family of open-source models for agentic coding.
Highlights:
- **State-of-the-Art Coding Agents**: Available in 9B-Dense, 31B-Dense, 35B-MoE, and 397B-MoE (post-trained on top of Gemma 4 and Qwen 3.5), achieving state-of-the-art performance among open-source models of comparable size on coding benchmarks such as Terminal-Bench 2.1, SWE-Bench, NL2Repo and OpenClaw.
- **Self-Improving Training Framework**:  Ornith-1.0 employs RL to learn to generate not only solution rollouts, but also the scallfold that drive those rollouts. By jointly optimizing the scaffold and the resulting solution, the model discovers better search trajectories and generates higher-quality solutions.
- **Licence**: MIT licensed, globally accessible, and free from regional limitations.
## Ornith 1.0 9B
This model card documents **Ornith-1.0-9B**, the most lightweight member of the Ornith family, designed for efficient single-GPU deployment.
### Benchmarks
Ornith-1.0-9B
Qwen3.5-9B
Qwen3.5-35B
Gemma4-12B
Gemma4-31B
Agentic Coding
...
license: "mit"
tags:
- llm
- gguf
overrides:
backend: llama-cpp
function:
automatic_tool_parsing_fallback: true
grammar:
disable: true
known_usecases:
- chat
options:
- use_jinja:true
- spec_type:draft-mtp
- spec_n_max:6
- spec_p_min:0.75
parameters:
model: llama-cpp/models/ornith-9b-mtp-kl-Q4_K_M/ornith-9b-mtp-kl-Q4_K_M.gguf
template:
use_tokenizer_template: true
files:
- filename: llama-cpp/models/ornith-9b-mtp-kl-Q4_K_M/ornith-9b-mtp-kl-Q4_K_M.gguf
sha256: 03de14e9849a02258b9ac9ec2830af39dd7ac48f1ca5492b35539d1838707bc8
uri: https://huggingface.co/protoLabsAI/Ornith-1.0-9B-MTP-GGUF/resolve/main/ornith-9b-mtp-kl-Q4_K_M.gguf
- name: "qwen-agentworld-35b-a3b"
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
urls:
@@ -1758,8 +1867,8 @@
use_tokenizer_template: true
files:
- filename: llama-cpp/models/Qwopus3.5-9B-Coder-MTP-GGUF/Qwopus3.5-9B-Coder-MTP-Q4_K_M.gguf
sha256: f6fc5d193045796d9e1870cbc40f827fe55f53f70593c3f5c1968b82b9331991
uri: https://huggingface.co/Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF/resolve/main/Qwopus3.5-9B-Coder-MTP-Q4_K_M.gguf
sha256: 9ea3ecd122a5165b8b81655f29eaf09d71daf841503e4c4212bdfadb36ab3712
- filename: llama-cpp/mmproj/Qwopus3.5-9B-Coder-MTP-GGUF/Qwopus3.5-9B-Coder-MTP-mmproj.gguf
sha256: f48daca405a1c768a9514e392c3955dcc4a9d66a5cf64cf45e064092b5f20ee4
uri: https://huggingface.co/Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF/resolve/main/Qwopus3.5-9B-Coder-MTP-mmproj.gguf

View File

@@ -461,10 +461,7 @@ func (p *RuleParser) parse(arena *Arena, ctx *ParseContext, start int) ParseResu
if result.Type != Fail {
text := ""
if result.Start < len(ctx.Input) {
end := result.End
if end > len(ctx.Input) {
end = len(ctx.Input)
}
end := min(result.End, len(ctx.Input))
text = ctx.Input[result.Start:end]
}
@@ -514,10 +511,7 @@ func (p *TagParser) parse(arena *Arena, ctx *ParseContext, start int) ParseResul
if result.Type != Fail {
text := ""
if result.Start < len(ctx.Input) {
end := result.End
if end > len(ctx.Input) {
end = len(ctx.Input)
}
end := min(result.End, len(ctx.Input))
text = ctx.Input[result.Start:end]
}

105
pkg/grpc/parentwatch.go Normal file
View File

@@ -0,0 +1,105 @@
package grpc
import (
"log"
"os"
"runtime"
"strings"
"time"
)
// Backend worker processes (the per-model gRPC servers LocalAI spawns) are
// deliberately placed in their own process group by the process manager so
// LocalAI's graceful shutdown can signal the whole group. That graceful path
// (SIGTERM -> grace -> SIGKILL, driven by pkg/signals + pkg/model) only runs
// when LocalAI itself receives a catchable signal and lives long enough to run
// its handlers. If LocalAI is SIGKILLed (e.g. a supervising process's
// graceful-shutdown grace period elapses first), that teardown never runs and
// this backend would be reparented to init and linger, holding VRAM and its
// listen port.
//
// The watcher below is a best-effort backstop for exactly that case: it does
// NOT replace the graceful teardown, it only covers the "parent vanished
// without cleaning up" path. It works by detecting reparenting: when the
// process that spawned this backend dies, the kernel reparents us to the
// nearest sub-reaper or to init (PID 1), so getppid() stops matching the value
// we captured at startup. This getppid() approach is portable across
// Linux/macOS (unlike Linux-only PR_SET_PDEATHSIG), which is why it's used
// here rather than a kernel parent-death signal.
const (
// EnvBackendParentWatch toggles the parent-death watcher. It is enabled by
// default; set it to a falsey value ("false", "0", "no", "off") to disable
// (e.g. when running a backend standalone for debugging under a shell whose
// lifetime shouldn't govern the backend).
EnvBackendParentWatch = "LOCALAI_BACKEND_PARENT_WATCH"
// EnvBackendParentWatchInterval overrides the poll interval as a Go
// duration string (e.g. "500ms"). Defaults to defaultParentWatchInterval.
EnvBackendParentWatchInterval = "LOCALAI_BACKEND_PARENT_WATCH_INTERVAL"
defaultParentWatchInterval = 2 * time.Second
)
// parentWatchEnabled reports whether the watcher should run in this process.
func parentWatchEnabled() bool {
switch strings.ToLower(strings.TrimSpace(os.Getenv(EnvBackendParentWatch))) {
case "false", "0", "no", "off":
return false
}
// Windows does not reparent orphans to a well-known init PID, so the
// getppid() heuristic used here doesn't apply there.
return runtime.GOOS != "windows"
}
// parentWatchInterval returns the configured poll interval, or the default.
func parentWatchInterval() time.Duration {
if v := os.Getenv(EnvBackendParentWatchInterval); v != "" {
if d, err := time.ParseDuration(v); err == nil && d > 0 {
return d
}
}
return defaultParentWatchInterval
}
// parentDied reports whether this process has been reparented away from the
// parent it had when the watcher started. Reparenting is the standard POSIX
// signal that the original parent (here, the LocalAI process that spawned this
// backend) has exited: the orphan is handed to the nearest sub-reaper or to
// init (PID 1), so getppid() no longer matches the value captured at startup.
func parentDied(origPPID int) bool {
ppid := os.Getppid()
return ppid != origPPID || ppid == 1
}
// watchParentDeath polls until parentDied reports the original parent is gone,
// then invokes onDeath. It blocks, so run it in its own goroutine.
func watchParentDeath(origPPID int, interval time.Duration, onDeath func()) {
ticker := time.NewTicker(interval)
defer ticker.Stop()
for range ticker.C {
if parentDied(origPPID) {
onDeath()
return
}
}
}
// startParentDeathWatcher installs the best-effort safety net described above
// on the calling backend process. It is a no-op when disabled or on platforms
// where the mechanism doesn't apply. This is a backstop alongside — never a
// replacement for — LocalAI's graceful SIGTERM->grace->SIGKILL teardown.
func startParentDeathWatcher() {
if !parentWatchEnabled() {
return
}
origPPID := os.Getppid()
// A parent of 1 at startup means we were already orphaned (or launched
// directly under init) — there's no original parent to watch for.
if origPPID <= 1 {
return
}
interval := parentWatchInterval()
go watchParentDeath(origPPID, interval, func() {
log.Printf("backend parent process (pid %d) exited without stopping this backend; self-terminating to avoid orphaning", origPPID)
os.Exit(1)
})
}

Some files were not shown because too many files have changed in this diff Show More