Compare commits

..

101 Commits

Author SHA1 Message Date
LocalAI [bot]
a0f3e26245 fix(distributed): make admin backend installs resilient and observable (#9958)
* feat(distributed): add configurable NATS backend install/upgrade timeouts

Adds BackendInstallTimeout and BackendUpgradeTimeout to DistributedConfig
with 15m defaults, following the existing MCPToolTimeout / WorkerWaitTimeout
pattern. These will replace the hardcoded literals in RemoteUnloaderAdapter
so admin-driven backend installs across the cluster survive long OCI image
pulls that previously timed out at 3m.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* style(distributed): gofmt alignment after timeout fields

Re-aligns the Validate() negative-duration map and the Default* const
block so the new BackendInstall/UpgradeTimeout entries do not leave
the surrounding columns mis-padded.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(cli): surface LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT and _UPGRADE_TIMEOUT

Parses the two new env vars on the run CLI and threads them through the
existing AppOption builder so DistributedConfig picks them up. Invalid
duration strings now fail loudly at startup rather than silently falling
back to the default.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): inject NATS install/upgrade timeouts into RemoteUnloaderAdapter

Removes the hardcoded 3m / 15m literals from RemoteUnloaderAdapter and
threads in DistributedConfig.BackendInstallTimeoutOrDefault() and
BackendUpgradeTimeoutOrDefault() at construction. Install now defaults
to 15m (was 3m); cold OCI image pulls on Jetson Wi-Fi routinely blew
past the old ceiling. Scripted messaging client captures the timeout
so tests can assert the configured value actually reaches the NATS
request.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): introduce galleryop.ErrWorkerStillInstalling sentinel

When the NATS request-reply for backend.install (or .upgrade) times out
the worker is almost always still pulling the OCI image. Wrap the timeout
in a typed sentinel so the manager above can distinguish "worker hung"
from "worker still working" and leave the pending_backend_ops row in
place for the reconciler to confirm via backend.list.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): treat NATS install timeout as in-progress, not failure

When a worker times out replying to backend.install but the install is
still running on the worker, enqueueAndDrainBackendOp now reports a
running_on_worker status and pushes NextRetryAt out by the install
timeout so the reconciler does not immediately re-fire another install
while the worker is still pulling the image. The pending_backend_ops
row stays in place for the next reconciler pass to confirm via
backend.list.

InstallBackend wraps the result in galleryop.ErrWorkerStillInstalling
so callers can branch (galleryop renders yellow in-progress instead of
red error). UpgradeBackend uses the same wrap.

Adds RemoteUnloaderAdapter.InstallTimeout() so the manager can push
NextRetryAt by the configured timeout without reaching into a private
field, and NodeRegistry.RecordPendingBackendOpInFlight as the soft
cousin of RecordPendingBackendOpFailure.

Also includes incidental gofmt-driven struct-field alignment in
registry.go on lines unrelated to the change (touched files are
re-formatted to canonical form per project policy).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): don't increment Attempts on in-flight install timeout

An in-flight timeout (worker still pulling the OCI image) is not a
failed attempt, it's a delayed one. Incrementing Attempts let
genuinely-progressing slow installs (e.g. 30 GB CUDA images on Wi-Fi)
trip the reconciler's maxPendingBackendOpAttempts cap and dead-letter
the queue row while the worker was still legitimately working.

RecordPendingBackendOpInFlight now only updates LastError and NextRetryAt.
Also documents "running_on_worker" in the NodeOpStatus.Status enum
comment so Task 6 implementers see the full surface.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(galleryop): surface ErrWorkerStillInstalling as non-error OpStatus

When the distributed backend manager returns an error that wraps
ErrWorkerStillInstalling, backendHandler now completes the op with a
"still installing in background" message rather than marking it as a
red failure. Admin UI sees a yellow in-progress state; reconciler
confirms completion on its next pass.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(distributed): end-to-end install-timeout-then-reconcile

Wires Task 1-6 end-to-end so any seam mismatch surfaces in CI rather
than during a real cluster install. NATS times out, the queue row
stays alive with running_on_worker status, the worker eventually
reports the backend installed via backend.list, the manager surfaces
it via ListBackends.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT / _UPGRADE_TIMEOUT

Add the two new operator-tunable env vars to the Frontend Configuration
table in the distributed-mode docs. Explains the 15m default, when to
raise it (slow links pulling multi-GB OCI images), and the new
"still installing in background" admin-UI state when the round-trip
times out but the worker is still working.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): clear pending install rows when backend.list confirms

DistributedBackendManager.ListBackends now proactively clears
pending_backend_ops install rows whose (nodeID, backend) is reported
installed by backend.list. Operator UI updates immediately instead of
waiting up to installTimeout (default 15m) for the next reconciler
tick after NextRetryAt.

Only install rows are cleared; upgrade and delete intents are not
satisfied by presence in backend.list and continue to drain through
their normal reconciler paths.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(messaging): add BackendInstallProgressEvent wire type and subject

New NATS subject nodes.<nodeID>.backend.install.<opID>.progress lets the
worker publish transient progress events (file, current/total bytes,
percentage, phase) while a long-running install pulls its OCI image.
BackendInstallRequest gains an optional OpID field so the worker knows
which subject to publish on.

Transient pub/sub (not JetStream): the install reply remains ground
truth for success/failure; dropped progress events are tolerable.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* style(messaging): drop em-dash from BackendInstallProgress test comment

Per project convention (no em-dashes anywhere). Comment substance is
unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): worker publishes debounced install progress over NATS

When BackendInstallRequest.OpID is set, the worker's backend.install
handler wires a debounced publisher (250ms window) into the gallery
download callback. Each tick becomes a BackendInstallProgressEvent on
nodes.<nodeID>.backend.install.<opID>.progress; the publisher always
emits a final event on Flush so the UI sees the terminal percentage.

Old masters that do not set OpID continue to run silent installs: no
behavior change for them. Lock ordering: the publisher releases its
mutex before calling messaging.Publish so a slow network never stalls
the install loop.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): RemoteUnloaderAdapter subscribes to install progress

InstallBackend gains opID + onProgress parameters. When both are set,
the adapter subscribes to nodes.<nodeID>.backend.install.<opID>.progress
BEFORE publishing the install request, decodes each message into the
caller's onProgress callback in a goroutine (so a slow callback never
stalls the NATS reader thread), and unsubscribes after RequestJSON
returns.

When onProgress is nil OR opID is empty (the reconciler retry path),
subscription is skipped entirely - silent installs cost nothing extra.

Subscribe failure is logged at Warn and the install proceeds without
progress streaming; the NATS round-trip still owns terminal status.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): forward backend install progress into galleryop OpStatus

DistributedBackendManager.InstallBackend now passes the gallery op ID
and a progress bridge into the adapter call. Each
BackendInstallProgressEvent from the worker becomes a
galleryop.ProgressCallback tick - which the existing backendHandler
already turns into OpStatus.UpdateStatus, so the admin UI/SSE polling
sees per-byte progress for distributed installs without any UI-side
change.

UpgradeBackend is intentionally left silent for now: its wire request
(BackendUpgradeRequest) does not carry OpID, and rolling-update
fallback is the rarer path. Will be picked up in a follow-up if the
worker upgrade path also gets a progress channel.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(distributed): InstallBackend tolerates silent (pre-Phase-2) workers

A worker on pre-Phase-2 code never publishes progress events. The new
master subscribes optimistically; this spec pins that a silent worker
still produces a green install with no progressCb ticks. The install
reply is the source of truth for terminal state; the progress stream
is a best-effort UX enrichment.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document install progress streaming

Note the new nodes.<nodeID>.backend.install.<opID>.progress subject and
the silent-worker compatibility behavior so operators know to expect
real-time progress and what happens on a mixed-version cluster.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): note progress-event ordering trade-off in InstallBackend

Document near the goroutine dispatch why ordering at the consumer is
best-effort, why it rarely matters in practice (worker debounce >>
goroutine jitter), and what a future hardening pass would look like
(Seq field + stale-by-seq drop). Stops the next reader from accidentally
"fixing" the goroutine pool away.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(galleryop): add NodeProgress + OpStatus.Nodes for per-node breakdown

Adds the data model the UI needs to render an expandable per-node
breakdown of a fanned-out backend install. NodeProgress carries node
identity (ID + name), per-node status (queued / running_on_worker /
success / error / downloading), the current file + bytes + percentage
from the Phase 2 progress stream, and any per-node error.

OpStatus.Nodes is the slice the /api/operations handler will surface
in a follow-up.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(galleryop): UpdateNodeProgress merges per-node ticks by NodeID

GalleryService.UpdateNodeProgress(opID, nodeID, np) merges a NodeProgress
into OpStatus.Nodes (keyed by NodeID, no duplicates) and mirrors the
latest tick into the aggregate Progress / FileName /
DownloadedFileSize / TotalFileSize fields so the legacy single-bar
OperationsBar view keeps working unchanged alongside the new per-node
breakdown.

Concurrent-safe via the existing g.Mutex.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): write per-node OpStatus entries during install fan-out

DistributedBackendManager now accepts a nodeProgressSink and feeds it
two streams:

1. enqueueAndDrainBackendOp emits a per-node terminal entry on each
   status it appends to BackendOpResult (queued, success, error,
   running_on_worker). The opID is threaded through the function so
   the sink gets the right gallery op identity.

2. The install apply closure fans each BackendInstallProgressEvent
   into the sink as a downloading entry, alongside the legacy
   progressCb path so the aggregate single-bar view stays correct.

Production wiring passes the GalleryService (which implements
UpdateNodeProgress via Task 2) as the sink. Single-node tests pass
nil. DeleteBackend and UpgradeBackend pass an empty opID so the
sink path no-ops for ops that aren't gallery-tracked the same way
as Install.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(operations): expose per-node breakdown on /api/operations

When an operation's OpStatus has Nodes entries (populated by the
Phase 4 progress sink wiring), surface them as a "nodes" array on the
/api/operations response, sorted by node_name for stable rendering.

Backward compatible: legacy clients ignore the field; ops without any
node entries (single-node mode, model installs) omit the array entirely
thanks to the empty-slice guard.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): per-node breakdown in OperationsBar

When an install op fans out to more than one worker, the operations
bar now shows a "N nodes" chevron that expands into a per-node list.
Each row carries the node's status (color-coded pill), the current
file being downloaded, byte counts, percentage, and a thin per-node
progress bar. Yellow "Worker busy" pill marks running_on_worker
status with a tooltip explaining the NATS round-trip timed out but
the worker is still installing in the background.

Backward compatible: ops without a nodes field (legacy or single-node
mode) render as before. State for expand/collapse is local to the
component, keyed by jobID/id - reload starts collapsed.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document per-node breakdown in the operations bar

Adds a short subsection covering the expandable "N nodes" chevron in
the OperationsBar admin UI, the meaning of each status pill, and
how it relates to the /api/operations nodes array.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(galleryop): UpdateStatus preserves Nodes when caller sends none

Real-world bug surfaced by the Phase 4 multi-worker smoke test: the
nodes[] array in /api/operations flickered between a single node at a
time on a 2-worker install. Root cause: the Phase 2 progress bridge
also calls the legacy progressCb -> UpdateStatus(&OpStatus{...}) on
every tick. UpdateStatus then overwrote the entire status pointer,
wiping the Nodes slice that UpdateNodeProgress had just merged in.

Fix: in UpdateStatus, if the incoming op has an empty Nodes slice,
carry forward the previous status's Nodes before storing. Callers
that explicitly populate Nodes still win (their slice replaces the
prior one, no merge across the two code paths).

Two regression specs added pinning both directions of the contract.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): strip implementation details from user-facing docs

Trim the new install/upgrade timeout rows and the install-progress
sections to focus on what the operator sees and tunes. Drops:

- the NATS subject names and pub/sub mechanics
- "round-trip" / reconciler / backend.list jargon
- /api/operations polling cadence
- "pre-2026-05-22" version references

Reframes the breakdown text around the admin UI (Operations Bar,
chevron, status pills, "Worker busy" tooltip). Implementation context
lives in the agent notes and code comments.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(config): move DistributedConfig.Validate flag names to constants

The negative-duration check map was a wall of literal kebab-case
strings that had to stay in sync with the kong-derived CLI flag names
manually. Move them to a Flag* const block alongside the existing
Default* block so a rename of either the Go field or the CLI naming
convention forces a compile error rather than silent drift.

Sole consumer today is Validate; the constants are exported so future
operator-facing surfaces (e.g. error messages on other validation
paths) can reference them by name instead of repeating the literals.

Tests pin both the literal values (so a future "let's just rename
this" doesn't accidentally regress the CLI flag) and the negative-
duration error message for the new BackendInstall / BackendUpgrade
fields.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(distributed): extract NodeStatus and Phase enums to constants

Sweep for the same literal-string-as-identifier pattern called out on
the Validate flag names: the per-node install status enum
("queued" | "downloading" | "running_on_worker" | "success" | "error")
appeared as raw literals across managers_distributed.go (10+ sites,
including 3 separate `n.Status == "running_on_worker"` checks),
operation.go, and the test suite. Same shape for the Phase enum
("resolving" | "downloading" | "extracting" | "starting") in the
worker-side progress publisher.

Promote both to exported const blocks:

- galleryop.NodeStatus{Queued,Downloading,RunningOnWorker,Success,Error}
  shared between galleryop.NodeProgress.Status (the wire field) and
  nodes.NodeOpStatus.Status (the in-process per-node summary)
- messaging.Phase{Resolving,Downloading,Extracting,Starting}
  shared between the worker publisher and any future consumer that
  needs to switch on phase

Tests pin both the literal values (so a future "let's just rename" doesn't
silently change the JSON wire) and use the constants in setup (so the
producer side stays drift-protected). Wire-format assertions on the
/api/operations JSON output keep their literals deliberately, so the
constant value can never silently diverge from what the UI receives.

Out of scope for this PR (separate cleanup): the finetune and
quantization job-status enums have the same anti-pattern with 14+
literal sites each, but predate this PR's work.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-23 12:35:44 +02:00
LocalAI [bot]
e4cc1f11f3 chore: ⬆️ Update ggml-org/llama.cpp to 1acee6bf8939948f9bcbf4b14034e4b475f06069 (#9952)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 08:38:29 +02:00
LocalAI [bot]
6ed269d0b9 chore: ⬆️ Update ggml-org/whisper.cpp to 0ccd896f5b882628e1c077f9769735ef4ce52860 (#9954)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 08:37:26 +02:00
LocalAI [bot]
5756fb046d chore: ⬆️ Update leejet/stable-diffusion.cpp to 0baf721215f45335a5df8caf0ecb34e870c956e7 (#9955)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 08:37:10 +02:00
Copilot
7980629bc5 Fix backend manifest merge signing on current cosign releases (#9957)
* Initial plan

* fix: remove deprecated cosign bundle flag from backend merge workflow

Agent-Logs-Url: https://github.com/mudler/LocalAI/sessions/4207dabc-14ec-4655-9594-487338977fcf

Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 00:20:28 +02:00
LocalAI [bot]
d0a59be9de chore: ⬆️ Update ikawrakow/ik_llama.cpp to b3d39cff8bffbd67296d6badd4076a1486a0715c (#9953)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-22 23:58:48 +02:00
LocalAI [bot]
5cda4f1ccf fix(L4T13 backends): switch vllm/sglang/vllm-omni to PyPI aarch64+cu130 wheels (#9950)
* fix(vllm): switch L4T13 backend to PyPI aarch64+cu130 wheels

The L4T13 vllm backend pulled torch / torchvision / torchaudio / vllm from
pypi.jetson-ai-lab.io's sbsa/cu130 mirror via [tool.uv.sources] with no
version pins. That mirror started shipping torch 2.11.0 next to a
vllm-0.20.0+cu130 wheel that was still compiled against torch 2.10's c10
ABI, so uv landed on the mismatched pair and vllm crashed at import:

  ImportError: vllm/_C.abi3.so: undefined symbol:
  _ZN3c1013MessageLoggerC1EPKciib

(c10::MessageLogger's constructor signature changed between torch 2.10 and
2.11; the vllm wheel referenced the 2.10 form, the installed libc10.so
exported only the 2.11 form.)

Since torch 2.11 (April 2026) PyPI publishes its own aarch64 + cu130
manylinux wheels, and vllm 0.20.0 ships an aarch64 wheel whose Requires-
Dist locks torch==2.11.0 / torchvision==0.26.0 / torchaudio==2.11.0. That
makes uv's resolver produce an ABI-consistent set automatically, so the
mirror and the [tool.uv.sources] pinning are no longer needed.

flash-attn is dropped from the dep list: PyPI has no aarch64 wheel, but
vLLM 0.20+ already bundles its own vllm_flash_attn (fa2 + fa3) inside the
main wheel, so the Dao-AILab package isn't required at runtime.

Reference: https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] [WebFetch]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(vllm): retire l4t13 pyproject.toml in favor of requirements-*.txt

pyproject.toml only existed because uv pip install -r requirements.txt
doesn't honor [tool.uv.sources]. The previous commit dropped [tool.uv.
sources] (PyPI now serves the aarch64 + cu130 wheels directly), so the
file no longer carries any logic the requirements-*.txt path can't.

Replace with the same two-file pattern every other build profile uses:

  - requirements-l4t13.txt       (accelerate / torch / transformers /
                                  bitsandbytes - matches cublas13's split)
  - requirements-l4t13-after.txt (vllm; runs after the base resolve so
                                  the cu130 torch wheel lands first)

install.sh's whole l4t13 elif branch goes away; libbackend.sh's
installRequirements already handles the requirements-install.txt build-
deps pass, the C_INCLUDE_PATH export for PORTABLE_PYTHON, and the
runProtogen call, so falling through to the standard else: branch
produces identical install behavior with less surface area.

No functional change at install time - same wheels, same order.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(sglang,vllm-omni): switch L4T13 backends to PyPI aarch64+cu130 wheels

Same root cause and same fix as the vllm backend in the previous commits:
the L4T13 sglang and vllm-omni backends both pulled their accelerator
stack from pypi.jetson-ai-lab.io's sbsa/cu130 mirror with no version
pins, so they would silently land on the same torch 2.11 vs cu130-built
wheel ABI mismatch the moment the mirror published an out-of-sync pair.

sglang
------

- Drop pyproject.toml + [tool.uv.sources]. The historical comment said
  the [all] extra was unsafe on aarch64 because of decord, but sglang
  0.5.x now uses `decord2` on aarch64/arm/armv7l (which ships cp312
  aarch64 wheels), so we can match cublas13's sglang[all]>=0.5.11 pin
  and stop being capped at the 0.5.1.post2 the L4T mirror shipped.
  That unblocks Gemma 4 / MTP recipes on Jetson Thor.
- New requirements-l4t13.txt mirrors the cublas13 split (accelerate /
  torch / torchvision / torchaudio / transformers), requirements-l4t13-
  after.txt carries sglang[all]>=0.5.11.
- install.sh's l4t13 elif branch goes away; falls through to the
  standard installRequirements path.

vllm-omni
---------

- requirements-l4t13.txt drops --extra-index-url to jetson-ai-lab and
  drops flash-attn (PyPI has no aarch64 wheel, vLLM 0.20+ bundles its
  own vllm_flash_attn fa2 + fa3 internally).
- install.sh's l4t13 vllm-install branch collapses into the cublas13
  branch since both now just run `pip install vllm --torch-backend=auto`
  against PyPI.
- --index-strategy=unsafe-best-match is dropped from the top-level
  l4t13 guard; without the L4T mirror in the picture it had no purpose.

The from-source vllm-omni install on top still keeps its existing
`sed -i '/^fa3-fwd[[:space:]]*==/d' requirements/cuda.txt` workaround -
fa3-fwd has no aarch64 wheel and no sdist, unrelated to flash-attn.

Reference: https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] [WebFetch]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(sglang): drop [all] extra on l4t13 - xatlas has no aarch64 wheel

CI revealed that sglang[all]==0.5.12 transitively pulls xatlas via the
[diffusion] sub-extra, and xatlas ships no aarch64 wheel. Its sdist
depends on scikit_build_core without declaring it in build-system.
requires, so under --no-build-isolation uv can't build it from source:

    × Failed to build `xatlas==0.0.11`
    ├─▶ The build backend returned an error
    ╰─▶ Call to `scikit_build_core.build.build_wheel` failed (exit status: 1)
        ModuleNotFoundError: No module named 'scikit_build_core'
    help: `xatlas` (v0.0.11) was included because `sglang[all]` (v0.5.12)
          depends on `xatlas`

Upstream sglang explicitly gates st_attn and vsa on
`platform_machine != aarch64` inside the same [diffusion] extra but
forgot xatlas - same class of bug that bit the old decord pin.

Use plain `sglang>=0.5.11` on l4t13. backend.py imports only base
sglang.srt symbols (Engine, ServerArgs, FunctionCallParser,
ReasoningParser); the [all] extras are optional accelerators not
required at import time. cublas13 (x86_64) keeps [all] because xatlas
has x86_64 wheels there.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-22 23:01:22 +02:00
LocalAI [bot]
c500461c69 feat(config): default prompt_cache_all to true (#9951)
Upstream llama.cpp defaults `cache_prompt = true` (common/common.h),
but `parse_options` in the grpc-server backend unconditionally forwards
the proto `PromptCacheAll` field, so any model that didn't set
`prompt_cache_all: true` in its YAML was getting `cache_prompt=false` —
silently overriding llama.cpp's own default. With `kv_unified` and
`cache_idle_slots` already on by default, this was the last piece
preventing the per-request prompt cache from being usable out of the
box.

Make `PromptCacheAll` tristate (`*bool`), default it to `true` in
`SetDefaults`, and dereference at the proto boundary. Users can still
opt out with an explicit `prompt_cache_all: false`. Same pattern as
`MMap`, `MMlock`, `Reranking`, etc.

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 22:06:22 +02:00
LocalAI [bot]
834ecc36bf fix(react-ui): unify backend-logs entry point for distributed mode (#9949)
In distributed mode the local /api/backend-logs WebSocket has nothing
behind it (inference runs on workers), so the "View backend logs" link
in Traces (and the action in Manage when previously not hidden) dead-
ended on /app/backend-logs/<modelId>. Manage worked around it by
hiding the action; Traces still rendered the link.

Make /app/backend-logs/:modelId the single, mode-aware entry point.
A new BackendLogsRouter probes useDistributedMode and forks:

  - standalone: existing local WebSocket view (BackendLogsDetail).
  - distributed: DistributedBackendLogsResolver fans out to each node
    via nodesApi.getModels, filters by model_name, and routes:
      * 0 hits   -> empty state with a link to the Nodes page.
      * 1 hit    -> <Navigate replace> to
                    /app/node-backend-logs/<nodeId>/<modelId>,
                    preserving the ?from= deep-link timestamp.
      * N hits   -> picker listing each hosting worker (node id,
                    replica index, load state) so the operator can
                    choose which worker's logs to view.

Bare modelId in the redirect target intentionally aggregates that
node's replicas via the worker's BackendLogStore, matching the
existing per-node link pattern in Nodes.jsx.

Revert the per-caller distributed checks now that routing is
centralised: drop the hidden:distributedMode guard on Manage's
Backend logs action, and remove the prop threading in Traces so the
link is unconditional. Any future view that wants to link to backend
logs uses the same URL and gets correct behaviour in both modes.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-22 22:00:08 +02:00
LocalAI [bot]
61bf34ea2f fix(traces): cap captured body size to keep admin Traces UI responsive (#9946)
The trace middleware buffered the full request and response bodies for every
JSON exchange. With a chatty agent-pool RAG workload, /embeddings responses
(large vector arrays) accumulated to tens of MB in the in-memory buffer; the
admin Traces page would then download and parse 40+ MB on every load and on
every 5s auto-refresh, locking the UI in a loading state.

Add LOCALAI_TRACING_MAX_BODY_BYTES (default 64 KiB) that caps each captured
body. The full payload still flows through to the real client; only the
trace copy is bounded. Exchanges record body_truncated and original
body_bytes so the dashboard can show that truncation happened. The cap is
configurable via env, CLI, and runtime_settings.json.

Also unblock recovery: the Traces page now keeps the Clear button enabled
while loading, since "buffer too large to render" is exactly when the user
needs to clear it.


Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-22 15:29:24 +02:00
LocalAI [bot]
0b2ae3c6ca fix(openai): stream usage non-zero when tools are enabled (#9941)
* chore: ignore local .worktrees directory

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(openai): stream usage non-zero when tools are enabled

The streaming chat-completions worker for tool-bearing requests
(processTools in core/http/endpoints/openai/chat.go) never forwarded the
cumulative TokenUsage from ComputeChoices to the chunks it placed on the
responses channel. The outer streaming loop's running usage tracker
therefore stayed at the zero value, and the include_usage trailer
reported {prompt_tokens:0, completion_tokens:0, total_tokens:0} whenever
the request carried a `tools` array. Without tools, the alternative
`process` path stamps Usage on every chunk, so that path was unaffected.

Forward the final TokenUsage via a usage-only sentinel chunk (empty
Choices, populated Usage) emitted right before close(responses). The
outer loop's per-chunk Usage capture moves above the empty-Choices skip
so the sentinel updates the tracker without ever reaching the wire,
keeping the existing OpenAI spec contract (intermediate chunks carry no
`usage` field, and the deferred-final-chunk helpers remain Usage-free
per the regression test for issue #8546).

Adds streamUsageFromTokenUsage, usageSentinelChunk, and
applyChunkToUsage helpers with focused Ginkgo coverage plus a flow-level
test that mirrors the outer-loop sequence.

Fixes #9927

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Claude Code]

* refactor(openai): return final TokenUsage from stream workers

Replace the usage-only sentinel SSE chunk introduced in the previous
commit with a plain return value. The streaming workers process and
processTools (now extracted as package-level processStream and
processStreamWithTools) return (backend.TokenUsage, error); the outer
ChatEndpoint loop reads the cumulative counts off the existing `ended`
channel (now carrying streamWorkerResult{usage, err}) and builds the
include_usage trailer from a normal Go value after the LOOP exits.

This drops the empty-Choices "skip but capture Usage" rule from the
outer loop and removes the usageSentinelChunk / applyChunkToUsage
helpers entirely. The SSE responses channel is back to a single
purpose: wire chunks only.

processStream and processStreamWithTools move into chat_stream_workers.go
so they can be exercised directly from tests. The chat_stream_usage_test.go
suite now drives the workers with a mocked backend.ModelInferenceFunc
and asserts on the returned TokenUsage. The regression coverage for
issue #9927 is therefore behavioral: reverting the fix (discarding
ComputeChoices' usage return) makes the assertions fail with concrete
count mismatches.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-22 10:13:41 +02:00
LocalAI [bot]
4735345105 chore: ⬆️ Update ggml-org/llama.cpp to bb28c1fe246b72276ee1d00ce89306be7b865766 (#9934)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-22 09:49:33 +02:00
LocalAI [bot]
7384fd800b chore: ⬆️ Update antirez/ds4 to 8d576642c39b9a2d782a80159ba84ef5a81c0b81 (#9932)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-22 08:31:49 +02:00
LocalAI [bot]
6942713d85 chore: ⬆️ Update leejet/stable-diffusion.cpp to 3a8788cb7d74f185d6b18688e9563015524ecaf5 (#9933)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-22 00:31:19 +02:00
LocalAI [bot]
0cf52c44d4 chore: ⬆️ Update ggml-org/whisper.cpp to 8443cf05e3fa8ce1b32348e1bcbcf8fc31f7f3ae (#9929)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-21 23:24:01 +02:00
LocalAI [bot]
0d34cf7cbd chore: ⬆️ Update ikawrakow/ik_llama.cpp to 48a55f74e4c6e2aeda363dd386c1ac9170a0af71 (#9930)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-21 23:23:37 +02:00
LocalAI [bot]
f0cb02afb8 feat(usage): attribute Sources rows to user accounts in admin view (#9935)
The merged feature (#9920) let admins see per-API-key and per-source
totals but did not surface which user owned each key, and lumped
every user's Web UI traffic into a single global Web UI row. This
makes the admin Sources tab properly per-user attributable:

- KeyTotal gains UserID + UserName, populated from the snapshot the
  usage middleware already records. The by_key roll-up now groups by
  (api_key_id, api_key_name, user_id, user_name).
- New SourceTotals.ByUserSource roll-up groups (source, user_id,
  user_name) for sources without a key identity (web, legacy). Only
  populated on the admin path (includeLegacy=true); the non-admin
  endpoint stays unchanged for backwards compatibility.
- SourcesTable accepts showUserColumn={isAdmin}; admin view renders
  a User column, makes the search match user name/id, and expands
  Web UI / legacy pseudo-rows from the global aggregate to one row
  per user using by_user_source.

Refs: #9862

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 23:23:06 +02:00
LocalAI [bot]
a39e025d64 fix(nodes): make per-node backend install async via gallery job queue (#9928)
* feat(galleryop): add TargetNodeID to ManagementOp for single-node installs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(galleryop): add NodeScopedKey helpers for per-node opcache rows

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(galleryop): use strings.Cut for NodeScopedKey parsing, reject empty nodeID

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): scope DistributedBackendManager.InstallBackend to single node via TargetNodeID

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(http): make /api/nodes/:id/backends/install async via gallery service job queue

The handler previously called unloader.InstallBackend synchronously and
blocked the browser for up to 3 minutes waiting on the NATS reply. It now
enqueues a TargetNodeID-scoped ManagementOp on BackendGalleryChannel and
returns HTTP 202 + jobID immediately, matching /api/backends/install/:id.

The opcache key is built via NodeScopedKey(nodeID, backend) so concurrent
installs of the same backend across different nodes do not stomp each
other. galleryService/opcache/appConfig are threaded through
RegisterNodeAdminRoutes for this.

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(http): log malformed backend_galleries override and stop test drain goroutine

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(api): expose nodeID for node-scoped backend ops in /api/operations

Node-scoped backend installs land in opcache under "node:<nodeID>:<backend>"
keys. Without splitting that prefix back out, the operations panel renders
the full key as the display name and has no structured way to label which
worker an install is targeting. Detect the prefix, surface nodeID as its own
response field, and reduce the display name back to the bare backend slug.
Bare (non-scoped) ops are left untouched so legacy installs do not gain a
misleading empty nodeID.

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(react-ui): poll job status for node-targeted backend installs

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(react-ui): make NodeInstallPicker state updates pure and surface cancellations as errors

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(react-ui): clarify async semantics in handleInstallOnTarget

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(http): use statusUrl casing for node install response to match codebase precedent

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 22:25:53 +02:00
Ettore Di Giacinto
05e8e1e9f4 ci(images): publish chronologically-orderable master-<epoch>-<sha> tags
The existing master push pipeline produces `master` (rolling) and
`sha-<short>` tags. Neither is orderable by build time, so downstream
GitOps that want to auto-bump to the newest master build (e.g. Flux
ImagePolicy) can't pick the latest from the tag list — alphabetical
sort over hex shas is effectively random, and the rolling `master`
tag can't be referenced as an immutable bump target.

Add a third tag of the form `master-<epoch>-<sha>` (Unix epoch in
seconds + short sha), gated on default-branch pushes via metadata-
action's `is_default_branch` predicate. The sha is retained for
traceability; the epoch makes the tags numerically orderable, so a
Flux ImagePolicy like

  filterTags:
    pattern: '^master-(?P<ts>[0-9]+)-[a-f0-9]+$'
    extract: '$ts'
  policy:
    numerical:
      order: asc

will reliably bump to the newest master build.

Applied to both image_build.yml (OCI labels stay consistent) and
image_merge.yml (the actual tag publisher via buildx imagetools).
2026-05-21 17:18:30 +00:00
Rin
a7f6cc8956 [utils] Fail immediately on extraction errors (#9926)
utils: fail immediately on extraction errors

Setting ContinueOnError to false ensures that ExtractArchive does not
leave the model or backend directory in an inconsistent state if a
partial failure occurs. This improves robustness against malformed
archives or unexpected I/O issues during installation.

Signed-off-by: RinZ27 <222222878+RinZ27@users.noreply.github.com>
2026-05-21 19:00:33 +02:00
LocalAI [bot]
f15b9178ec feat(usage): track and visualise usage per API key (#9920)
* feat(usage): add Source, APIKeyID, APIKeyName columns to UsageRecord

Adds three additive columns plus UsageSource* constants. The columns
are auto-migrated by InitDB. APIKeyID is a nullable foreign reference
to UserAPIKey.ID; APIKeyName is snapshotted on each row so revoked
keys keep showing their name in history.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): backfill Source on pre-feature usage rows

InitDB now classifies any pre-existing usage_record with an empty
source: 'legacy-api-key' user -> legacy, everything else -> web.
The backfill is idempotent (only touches NULL/empty rows).

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): add GetUserUsageBySource aggregator

Groups by (bucket, source, api_key_id, api_key_name). Filters out
legacy by default. Returns both per-bucket detail and roll-ups
(by_source, by_key sorted desc and capped at 200, grand_total).

The MAX(created_at) projection is iterated via Rows().Scan into a
string column and parsed manually because the SQLite driver surfaces
the aggregated timestamp as a string, which database/sql refuses to
scan directly into time.Time. Postgres returns a real timestamp; the
same string path handles its RFC3339 form too.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(usage): log Rows() errors and assert LastUsed in tests

Adds rows.Err() and Rows() open-failure logging in
computeSourceTotals so silent data drops surface in logs. Logs on
parseLastUsedString format misses for the same reason. Strengthens
the snapshot-survival test to assert LastUsed is a recent timestamp,
locking the SQLite time-string parser behaviour.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): add admin GetAllUsageBySource with filters and truncation

Optional user_id and api_key_id filters (composed with AND). Legacy
bucket is included for admin callers. truncated=true when more than
200 distinct keys would be in the by_key roll-up.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(auth): plumb auth_source and auth_apikey through Echo context

tryAuthenticate now sets auth_source on every successful branch
(web for session/Bearer-session, apikey for Bearer-key/x-api-key/
token-cookie, legacy for legacy env key match). For named-key
branches it also stores the resolved *UserAPIKey under auth_apikey
so downstream middlewares can snapshot id+name without re-validating.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(auth): expand tryAuthenticate godoc and cover Bearer-session branch

Documents all three context-keys side effects (auth_source,
auth_apikey, _auth_session) plus the split of responsibilities with
the parent Middleware. Adds a test for the Bearer-as-session-token
classification so future regressions there fail loudly.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): UsageMiddleware records source + snapshots key name

Reads auth_source and auth_apikey from the Echo context (set by
auth.Middleware in the previous task). Snapshots UserAPIKey.ID and
Name onto each row so revoked keys remain readable in history.
Falls back to source=web when no auth_source is set (auth disabled
or unrecognised path).

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): add /api/auth/usage/sources and admin variant

Self endpoint filters legacy server-side; admin endpoint includes
legacy and accepts user_id + api_key_id filters. Response includes
buckets, totals.{by_source, by_key, grand_total}, and a truncated
flag set when the per-key roll-up was capped at 200.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(routes): mark test mirror handlers as keep-in-sync with production

The newTestAuthApp helper duplicates production route handlers
inline because it cannot use RegisterAuthRoutes (which requires a
*application.Application). Naming the source path on each mirror
makes the drift contract explicit for future maintainers.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add usageApi.getMySources/getAdminSources + i18n strings

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add Sources tab skeleton with data fetch

Adds Usage page tab that fetches /api/auth/usage/sources (or the
admin variant). Renders raw totals plus a placeholder key list;
real visualisations land in subsequent commits. Restructures the
existing tab button block so Models and Sources are visible to
non-admins (Users remains admin-only).

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): source mix ribbon + searchable/sortable sources table

Replaces the SourcesTab placeholder rendering with two reusable
components: SourceMixRibbon (one segmented bar per source class)
and SourcesTable (search + sort + revoked-key dim). Pulls the
current API key list to detect revoked keys.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ui): skip revoked-key detection until the key list is known

existingKeyIds defaulted to an empty Set, which made every live
api_key row render as (revoked) during the brief window before
apiKeysApi.list() resolved, and permanently after a fetch failure.
Use null as the unknown state and suppress the revoked badge until
the parent provides a real Set.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): top-N stacked time chart and drill-in chip for Sources tab

Top 7 sources by total tokens get distinct colours; the rest roll up
into 'Other'. Clicking a row in the SourcesTable dims everything
except that series in the chart; the chip is the canonical clear.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(usage): document per-API-key Sources tab and endpoints

Extends features/authentication.md Usage Tracking section with:
- A 'Sources' tab description and source-class taxonomy
- Endpoint documentation for /api/auth/usage/sources and the
  admin variant
- Response shape example with by_source / by_key / grand_total
- Migration note about pre-feature row backfill

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(usage): silence errcheck on deferred rows.Close

CI errcheck flagged the bare 'defer rows.Close()' in
computeSourceTotals. Wrap in a closure that discards the close
error explicitly; an error here is non-actionable since we have
already drained the rows and logged any iteration failure.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(usage): bound batcher intake and add Shutdown/FlushNow hooks

The pre-existing usage batcher had no cap on its add() path; the
usageMaxPending=5000 constant only guarded the re-queue path after
a failed write, leaving memory growth unbounded if the DB fell
behind. This commit:

- Adds the cap to add() so saturation drops new records (rate-limited
  warn at 1/1024) instead of growing unbounded.
- Raises usageMaxPending to 50000 to absorb realistic inference bursts.
- Replaces the package-level batcher global with a mutex-guarded pair
  plus a currentBatcher() accessor so Init / Shutdown cycles are
  race-free.
- Adds ShutdownUsageRecorder() for graceful drain on process exit
  (not yet wired into app shutdown, just published).
- Adds FlushNow() for deterministic tests; the middleware suite no
  longer needs 6s sleeps per spec and now runs in ~50ms instead of 18s.
- Re-queue on failed flush is now cap-aware: prepends as much of the
  failed batch as fits alongside concurrent arrivals, instead of
  dropping the whole batch when full.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): drain usage batcher on graceful shutdown

Registers ShutdownUsageRecorder with the existing
signals.RegisterGracefulTerminationHandler so SIGINT/SIGTERM
synchronously flushes any in-memory usage records before the
process exits. Without this, up to one flush interval (5s) of
recorded usage was lost when LocalAI restarted.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 16:34:02 +02:00
LocalAI [bot]
959de86761 feat(llama-cpp): make server-side prompt cache work by default (#9925)
Aligns LocalAI's llama-cpp gRPC backend with upstream's auto-on prompt
cache path so repeated system prompts (agents, OpenAI/Anthropic-compatible
CLIs, coding assistants) skip prefill on subsequent calls without any
YAML changes. Reported in #9921.

Upstream's server enables `kv_unified=true` (and bumps `n_parallel` to 4)
when slot count is auto, which unlocks `cache_idle_slots`. LocalAI
hardcodes `n_parallel=1` and so far also hardcoded `kv_unified=false`,
which silently force-disables idle-slot saving at server init. The host
prompt cache was allocated but never written across requests.

Changes in backend/cpp/llama-cpp/grpc-server.cpp:
- params.kv_unified: false -> true (single-slot path now benefits from
  the prompt cache; users can opt out with `kv_unified:false`)
- params.n_ctx_checkpoints: 8 -> 32 (match upstream default)
- params.cache_idle_slots = true initialized explicitly (upstream default)
- params.checkpoint_every_nt = 8192 initialized explicitly (upstream default)
- New option parsers: cache_idle_slots / idle_slots_cache,
  checkpoint_every_nt / checkpoint_every_n_tokens

Docs:
- features/text-generation.md: fix misleading `cache_ram` description
  (it's the host-side prompt cache, not the KV cache), document the
  kv_unified + cache_ram + cache_idle_slots interaction, add rows for
  the two newly-exposed options, and add a worked example for the
  agent/CLI workload from the issue.
- advanced/model-configuration.md: mark the legacy `prompt_cache_path`
  / `prompt_cache_all` / `prompt_cache_ro` YAML fields as unused by the
  llama-cpp gRPC backend (they target upstream's CLI completion tool
  and are not consumed by grpc-server.cpp) and point readers at the
  new prompt-cache explainer.

Closes #9921

Assisted-by: claude:opus-4.7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 16:31:48 +02:00
LocalAI [bot]
4c234abc2c refactor(agents): bump skillserver, drop redundant Name from list_skills output (#9916)
refactor(agents): bump skillserver, drop redundant Name from list_skills/search_skills

skillserver's list_skills MCP tool used to ship every entry with name=""
(field was commented out), while search_skills populated it - two tools
with inconsistent shape for the same data. skill.Name and skill.ID are
populated from the same source string anyway (the directory name), so
returning both was pure duplication.

Bumps github.com/mudler/skillserver to a7317cb, which drops the Name
field from both SkillInfo and SearchResult and leaves ID as the single
canonical identifier (already what read_skill consumes).

Adds core/services/skills/skills_mcp_test.go, a regression that drives
the LocalAI FilesystemManager through an in-process MCP session and
asserts a newly-created skill is visible by ID on the still-open session.

This is a cleanup, not the root cause of #9868 - the reporter likely
sees something deeper than a cosmetic JSON shape issue.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 14:45:53 +02:00
Richard Palethorpe
c68818a62e fix(llama-cpp): terminate tensor_buft_overrides with sentinel (#9919)
llama.cpp's model loader asserts back().pattern == nullptr on
params.tensor_buft_overrides (and on params.kv_overrides.back().key[0]
== 0) before binding them into llama_model_params. PR #8560 attempted
to satisfy llama_params_fit's placeholder requirement by pre-filling
params.tensor_buft_overrides up to llama_max_tensor_buft_overrides()
*before* the option-parse loop. Any subsequent push_back from
override_tensor / draft_cpu_moe / draft_n_cpu_moe / draft_override_tensor
then appended real entries after the placeholders, leaving back() with
a real pattern and tripping the assert. The draft override vector
likewise had no terminator at all.

Mirror upstream common/arg.cpp:645-658 instead: real entries are
pushed during option parsing, and after parsing we pad the main vector
up to ntbo (placeholders land at the end, so back() is always nullptr)
and append a single {nullptr, nullptr} to the draft vector when it is
non-empty. The existing kv_overrides terminator block already matches
upstream and stays.

Verified against ggml-org/llama.cpp@5cbaa5e: only tensor_buft_overrides
(main + draft) and kv_overrides are sentinel-terminated common_params
fields; everything else is size-driven std::vector.

Assisted-by: claude-code:claude-opus-4-7

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-21 12:55:06 +02:00
LocalAI [bot]
11d5bd0cc3 fix(react-ui/chat): stop wiping selection on every /api/operations poll (#9904) (#9917)
useOperations() was calling setOperations() with a fresh array on every
1s poll, even when the payload was identical. In React 19 the DOM diff
no longer short-circuits dangerouslySetInnerHTML on equal __html, so the
forced Chat re-render re-assigned innerHTML on every assistant message
once per second — wiping any text the user had selected.

Skip the state update when the serialised operations payload is
unchanged, and switch loading/error to functional setters so they also
short-circuit at the source.

Also fixes the chat copy button on plain HTTP: navigator.clipboard is
undefined in non-secure contexts (a common LXC+Docker deployment), but
the previous code called it unconditionally and showed a success toast
regardless. Routed Chat, AgentChat and CanvasPanel through a new
copyToClipboard() helper that uses navigator.clipboard when available
and falls back to a hidden-textarea + execCommand('copy') trick that
browsers still honour outside secure contexts. The fallback preserves
the user's existing selection.

Regression coverage in e2e/chat-polling-selection.spec.js: a
MutationObserver counts mutations on the assistant content node across
3s of polling (must be 0); the copy test stubs out navigator.clipboard
and asserts that execCommand('copy') is invoked.


Assisted-by: claude-opus-4-7-1m

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 12:17:51 +02:00
LocalAI [bot]
12e056e96d chore: ⬆️ Update ggml-org/llama.cpp to ad277572619fcfb6ddd38f4c6437283a4b2b8636 (#9915)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-21 09:07:31 +02:00
LocalAI [bot]
308aa8908a chore: ⬆️ Update ace-step/acestep.cpp to ed53caf164e4492a5620b2e3f2264629cf66da24 (#9913)
⬆️ Update ace-step/acestep.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-21 00:15:57 +02:00
LocalAI [bot]
b2d68a53a2 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 11a1fea9e291f12ce2c803a9d7812c30ca806bcf (#9914)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 22:04:06 +00:00
LocalAI [bot]
e3706c0512 chore(model-gallery): ⬆️ update checksum (#9910)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 23:38:45 +02:00
LocalAI [bot]
1ffd82a050 chore: ⬆️ Update antirez/ds4 to 2606543be7a8c125a32cee37f5d1d85dc78f2fcf (#9909)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 21:22:26 +00:00
LocalAI [bot]
f515168dbe chore(acestep-cpp): bump pin to ed53caf and adapt wrapper to new API (#9908)
The new ace-step.cpp revision moves backend initialization inside each
`*_load` call and drops the separate `DiTGGMLConfig` argument from
`dit_ggml_load` (config now lives in `DiTGGML::cfg`, populated from GGUF
metadata at load time). Drop the now-removed `*_init_backend` calls and
replace `g_dit_cfg` accesses with `g_dit.cfg`.


Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-20 21:05:32 +00:00
LocalAI [bot]
ef6ca34513 chore: ⬆️ Update leejet/stable-diffusion.cpp to 5b0267e941cade15bd80089d89838795d9f4baa6 (#9907)
Adapt the C++ wrapper to the new `generate_video()` signature: upstream now
returns `bool` and writes frames/audio via out-parameters (`sd_image_t**`,
`sd_audio_t**`). Also set `p->fps` on the params struct (new upstream field)
and free the returned audio handle on both the success and error paths.


Assisted-by: claude-code:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-20 20:53:19 +00:00
dependabot[bot]
9413c3767f chore(deps): update transformers requirement from >=5.8.0 to >=5.8.1 in /backend/python/transformers (#9883)
chore(deps): update transformers requirement

Updates the requirements on [transformers](https://github.com/huggingface/transformers) to permit the latest version.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v5.8.0...v5.8.1)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 5.8.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-20 22:16:02 +02:00
dependabot[bot]
3bf3cce232 chore(deps): bump sentence-transformers from 5.4.0 to 5.5.0 in /backend/python/transformers (#9888)
chore(deps): bump sentence-transformers in /backend/python/transformers

Bumps [sentence-transformers](https://github.com/huggingface/sentence-transformers) from 5.4.0 to 5.5.0.
- [Release notes](https://github.com/huggingface/sentence-transformers/releases)
- [Commits](https://github.com/huggingface/sentence-transformers/compare/v5.4.0...v5.5.0)

---
updated-dependencies:
- dependency-name: sentence-transformers
  dependency-version: 5.5.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-20 22:13:39 +02:00
LocalAI [bot]
06f8159035 chore: ⬆️ Update ggml-org/llama.cpp to 67ace021da905e27ecbdf1176b0eef578a5288c0 (#9897)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 22:05:58 +02:00
LocalAI [bot]
f6a73f54fa feat(swagger): update swagger (#9872)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 22:05:35 +02:00
LocalAI [bot]
24e04d8e81 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 77413bc900f9a2bfd8a5407f184427bcc0825f6c (#9899)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 01:02:53 +02:00
LocalAI [bot]
b9a49449ae chore: ⬆️ Update ggml-org/whisper.cpp to afa2ea544fb4b0448916b4a31ecd33c8685bd482 (#9898)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 01:02:25 +02:00
LocalAI [bot]
1879e11042 chore: ⬆️ Update antirez/ds4 to 599e49d253971451f710cb8323344e789906ed6c (#9900)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 01:01:45 +02:00
LocalAI [bot]
403d391316 chore(model-gallery): ⬆️ update checksum (#9901)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 01:01:20 +02:00
Daniel Liljeberg
fc3980dadd fix: inject text-file content into chat completions messages (#9896)
Non-image/non-audio file attachments (txt, md, csv, json) were being
  stored in the 'files' metadata field but never added to the message
  content array sent to /v1/chat/completions. Images and audio correctly
  received content blocks; files did not.

  Fix: push a text content block into messageContent when textContent is
  present, matching the pattern used for image_url and audio_url.

  Also fixes Home.jsx addFiles which never called file.text() at all,
  meaning files attached on the home screen had empty textContent even
  before reaching useChat.js.

  Note: PDF files use file.text() which returns raw bytes rather than
  parsed text. Proper PDF support would require PDF.js or server-side
  extraction and is not part of this fix.

Signed-off-by: Daniel Liljeberg <damien_@hotmail.com>
2026-05-20 01:00:32 +02:00
Richard Palethorpe
2009544b44 fix(nix): correct flake src path and add dev shell (#9894)
The flake set `src = ./sources;` referencing a non-existent subdirectory,
so `nix build` and `nix develop` both failed evaluation. Point `src` at
the repo root and refresh `vendorHash` accordingly.

Add `devShells.default` with the Go toolchain, protobuf generators,
Node.js/bun for the React UI (`make react-ui`), and the linters used by
`make lint` (golangci-lint, gofumpt, goimports, staticcheck).

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-19 19:28:30 +02:00
dependabot[bot]
e859345b12 chore(deps): bump github.com/alecthomas/kong from 1.14.0 to 1.15.0 (#9881)
Bumps [github.com/alecthomas/kong](https://github.com/alecthomas/kong) from 1.14.0 to 1.15.0.
- [Commits](https://github.com/alecthomas/kong/compare/v1.14.0...v1.15.0)

---
updated-dependencies:
- dependency-name: github.com/alecthomas/kong
  dependency-version: 1.15.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-19 08:07:07 +02:00
dependabot[bot]
f30712f8e8 chore(deps): bump github.com/aws/aws-sdk-go-v2 from 1.41.6 to 1.41.7 (#9892)
Bumps [github.com/aws/aws-sdk-go-v2](https://github.com/aws/aws-sdk-go-v2) from 1.41.6 to 1.41.7.
- [Release notes](https://github.com/aws/aws-sdk-go-v2/releases)
- [Commits](https://github.com/aws/aws-sdk-go-v2/compare/v1.41.6...v1.41.7)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go-v2
  dependency-version: 1.41.7
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-19 08:06:50 +02:00
dependabot[bot]
a19c77c5f8 chore(deps): bump github.com/onsi/ginkgo/v2 from 2.28.2 to 2.29.0 (#9882)
Bumps [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) from 2.28.2 to 2.29.0.
- [Release notes](https://github.com/onsi/ginkgo/releases)
- [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md)
- [Commits](https://github.com/onsi/ginkgo/compare/v2.28.2...v2.29.0)

---
updated-dependencies:
- dependency-name: github.com/onsi/ginkgo/v2
  dependency-version: 2.29.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-19 08:06:34 +02:00
LocalAI [bot]
4b02d23c0c chore: ⬆️ Update ggml-org/llama.cpp to 5cbaa5e69e09bde3334cd8c355570553a0dca027 (#9876)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-19 08:06:16 +02:00
LocalAI [bot]
21140e96b2 chore: ⬆️ Update ggml-org/whisper.cpp to 47b9eb37a33c5031a1b667ace64477330b9f36c1 (#9877)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-19 08:05:56 +02:00
dependabot[bot]
fc803e8d48 chore(deps): bump golang.org/x/crypto from 0.50.0 to 0.51.0 (#9886)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.50.0 to 0.51.0.
- [Commits](https://github.com/golang/crypto/compare/v0.50.0...v0.51.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-version: 0.51.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-19 08:04:15 +02:00
LocalAI [bot]
ca51606bfe chore: ⬆️ Update ikawrakow/ik_llama.cpp to 40aae0b6d86d50c0ee7011b3ce59a233203e430a (#9875)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-19 08:01:41 +02:00
Azteczek
cb502de309 feat: add flake.nix for dockerless setup (#9851)
* Add flake.nix

Signed-off-by: Azteczek <243776410+Azteczek@users.noreply.github.com>

* Add flake.lock

Signed-off-by: Azteczek <243776410+Azteczek@users.noreply.github.com>

---------

Signed-off-by: Azteczek <243776410+Azteczek@users.noreply.github.com>
2026-05-18 15:23:10 +01:00
Richard Palethorpe
5d0b549049 feat(gallery): verify backend OCI images with keyless cosign (#9823)
* feat(gallery): verify backend OCI images with keyless cosign

Close a trust gap where a registry compromise or MITM could silently
replace a backend image: the gallery YAML tells LocalAI which image to
pull, but until now nothing verified the bytes came from our CI.

Consumer (pkg/oci/cosignverify):
- New package using sigstore-go to verify keyless-cosign signatures.
- OCI 1.1 referrers API + new bundle format (no legacy :tag.sig).
- Policy fields: Issuer / IssuerRegex / Identity / IdentityRegex /
  NotBefore. NotBefore is the revocation lever — keyless Fulcio certs
  are ephemeral so revocation is policy-side; advancing not_before in
  the gallery YAML invalidates every signature predating the cutoff.
- TUF trusted root cached process-wide so N backends from one gallery
  do 1 fetch, not N.

Plumbing:
- pkg/downloader: ImageVerifier interface + WithImageVerifier option
  threaded through DownloadFileWithContext. Verification runs between
  oci.GetImage and oci.ExtractOCIImage, with digest pinning via
  pinnedImageRef to close the TOCTOU window. Skips the verifier's HEAD
  when the ref is already digest-pinned.
- core/config: Gallery.Verification YAML block.
- core/gallery: backendDownloadOptions builds the verifier from the
  policy; applied on initial URI, mirrors, and tag fallbacks.
- core/gallery/upgrade: the upgrade path now routes through the same
  options builder. A regression Ginkgo spec pins this contract —
  without it, UpgradeBackend silently bypassed verification.
- core/cli: --require-backend-integrity (LOCALAI_REQUIRE_BACKEND_INTEGRITY)
  escalates missing policy / empty SHA256 from warn to hard-fail.

Producer (.github/workflows/backend_merge.yml):
- id-token: write at job scope (PR-fork-safe via existing event gate).
- sigstore/cosign-installer@v3 pinned to v2.4.1.
- After each docker buildx imagetools create, resolve the manifest
  list digest and run cosign sign --recursive --new-bundle-format
  --registry-referrers-mode=oci-1-1 against repo@digest. --recursive
  signs the index and every per-arch entry, matching how the consumer
  resolves a tag to a platform-specific manifest before verifying.

Rollout: backend/index.yaml has no `verification:` block yet, so this
PR is backward-compatible — installs proceed with a warning until the
gallery is populated. Strict mode is opt-in.

Assisted-by: claude-code:claude-opus-4-7 [Bash] [Edit] [Read] [Write] [WebSearch] [WebFetch]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* refactor(gallery): plumb RequireBackendIntegrity through config instead of env

The previous implementation re-exported the --require-backend-integrity
CLI flag into LOCALAI_REQUIRE_BACKEND_INTEGRITY via os.Setenv, then
re-read it in core/gallery via os.Getenv. This leaked process state
into the gallery package and made the flag impossible to override
per-call or test without touching the env.

Add RequireBackendIntegrity to ApplicationConfig (with a matching
WithRequireBackendIntegrity AppOption) and thread the bool through
every install/upgrade path: InstallBackend, InstallBackendFromGallery,
UpgradeBackend, InstallModelFromGallery, InstallExternalBackend,
ApplyGalleryFromString/File, startup.InstallModels. Worker subcommands
gain the same env-bound flag on WorkerFlags so distributed-worker
installs honor it consistently with the worker daemon path.

Add a forbidigo lint rule against os.Getenv / os.LookupEnv / os.Environ
to keep the env-leak pattern from creeping back. Existing offenders
(p2p, config loaders, etc.) are baseline-grandfathered by the existing
new-from-merge-base: origin/master setting; targeted path exclusions
cover the legitimate cases — kong CLI entry points, backend
subprocesses, system capability probes, gRPC AUTH_TOKEN inheritance,
test gating env vars.

Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-18 08:02:20 +02:00
LocalAI [bot]
11cff1b309 chore: ⬆️ Update ggml-org/llama.cpp to 87589042cac2c390cec8d68fb2fad64e0a2a252a (#9855)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-18 08:01:30 +02:00
LocalAI [bot]
4ca3d2cdc0 docs: ⬆️ update docs version mudler/LocalAI (#9863)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-17 23:20:16 +02:00
LocalAI [bot]
3cba35ed32 chore: ⬆️ Update antirez/ds4 to c9dd9499bfa57c1bbfbb4446eff963330ab5329b (#9864)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-17 23:19:58 +02:00
LocalAI [bot]
265ae35231 chore: ⬆️ Update ikawrakow/ik_llama.cpp to c35189d83c91aad780aba62b89f2830cb2916223 (#9866)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-17 23:19:43 +02:00
LocalAI [bot]
6a48157a80 chore: ⬆️ Update leejet/stable-diffusion.cpp to bd17f53b7386fb5f60e8587b75e73c4b2fed3426 (#9854)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 23:12:05 +02:00
LocalAI [bot]
41c838b2df chore: ⬆️ Update ikawrakow/ik_llama.cpp to 3e573cfea6e0a332eff822ffbdb1dd3b112e9051 (#9856)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 22:44:08 +02:00
LocalAI [bot]
21e793ad2a chore: ⬆️ Update antirez/ds4 to ef0a4905d05263df8e63689f2dd1efac618a752c (#9857)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 22:43:46 +02:00
LocalAI [bot]
7c190bb4b9 docs: ⬆️ update docs version mudler/LocalAI (#9853)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 22:43:06 +02:00
LocalAI [bot]
d77a9137d8 feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults (#9852)
* feat(llama-cpp): bump to MTP-merge SHA and document draft-mtp spec type

Update LLAMA_VERSION to 0253fb21 (post ggml-org/llama.cpp#22673 merge,
2026-05-16) to pick up Multi-Token Prediction support.

No grpc-server.cpp changes are required: the existing `spec_type` option
delegates to upstream's `common_speculative_types_from_names()`, which
already accepts the new `draft-mtp` name. The `n_rs_seq` cparam needed
by MTP is auto-derived inside `common_context_params_to_llama` from
`params.speculative.need_n_rs_seq()`, and when no `draft_model` is set
the upstream server builds the MTP context off the target model itself.

Docs: extend the speculative-decoding section of the model-configuration
guide with the new type, both load paths (MTP head embedded in the main
GGUF vs. separate `mtp-*.gguf` sibling), the PR's recommended
`spec_n_max:2-3`, and the chained `draft-mtp,ngram-mod` recipe. Also
notes that the upstream `-hf` auto-discovery of `mtp-*.gguf` siblings is
not wired through LocalAI's gRPC layer.

Agent guide: short note explaining that new upstream spec types are
picked up automatically and that MTP needs no gRPC plumbing.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): auto-detect MTP heads and enable draft-mtp on import + load

Detect upstream's `<arch>.nextn_predict_layers` GGUF metadata key (set by
`convert_hf_to_gguf.py` for Qwen3.5/3.6 family models and similar) and,
when present and the user has not configured a `spec_type` explicitly,
auto-append the upstream-recommended speculative-decoding tuple:

  - spec_type:draft-mtp
  - spec_n_max:6
  - spec_p_min:0.75

The 0.75 p_min is pinned defensively because upstream marks the current
default with a "change to 0.0f" TODO; locking it here keeps acceptance
thresholds stable across future llama.cpp bumps.

Detection runs in two places:

  - The model importer (`POST /models/import-uri`, the `/import-model`
    UI) range-fetches the GGUF header for HuggingFace / direct-URL
    imports via `gguf.ParseGGUFFileRemote`, with a 30s timeout and
    non-fatal error handling. OCI/Ollama URIs are skipped because the
    artifact is not directly streamable; the load-time hook covers them
    once the file is on disk.
  - The llama-cpp load-time hook (`guessGGUFFromFile`) reads the local
    header on every model start and appends the same options if
    `spec_type` is not already set.

Both paths share `ApplyMTPDefaults` and respect an explicit user-set
`spec_type:` / `speculative_type:` so YAML overrides win. Ginkgo
specs cover the append, preserve-user-choice, legacy alias, and nil
safety paths.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(importer): resolve huggingface:// URIs before MTP header probe

`gguf.ParseGGUFFileRemote` only speaks HTTP(S), but the importer was
handing it the raw `huggingface://...` URI directly (and similarly for
any other custom downloader scheme). Live-test against
`huggingface://ggml-org/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-MTP-Q8_0.gguf`
exposed this: the probe failed with `unsupported protocol scheme
"huggingface"`, was caught by the non-fatal error path, and the MTP
options were silently never applied to the generated YAML.

Route every candidate URI through `downloader.URI.ResolveURL()` and
require the resolved form to be HTTP(S). After the fix the probe
successfully reads `<arch>.nextn_predict_layers=1` from the real HF
GGUF and the emitted ConfigFile carries spec_type:draft-mtp,
spec_n_max:6, spec_p_min:0.75 as intended.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-16 22:42:48 +02:00
LocalAI [bot]
661a0c3b9d fix(ollama): accept float-encoded integer options (fixes #9837) (#9849)
fix(ollama): accept float-encoded integer options (num_ctx, top_k, ...)

Home Assistant's Ollama integration encodes integer options as JSON
floats (e.g. `"num_ctx": 8192.0`). Stdlib `json.Unmarshal` refuses to
decode a number with fractional notation into an `int` field, so the
entire request was rejected with HTTP 400 before reaching the backend:

  Unmarshal type error: expected=int, got=number 8192.0,
  field=options.num_ctx

Add a custom `UnmarshalJSON` on `OllamaOptions` that routes the int
fields (`top_k`, `num_predict`, `seed`, `repeat_last_n`, `num_ctx`)
through `*json.Number`, then converts via `Int64()` with a `Float64()`
fallback. Public field types are unchanged, so endpoint code is
untouched. Float fields and `stop` continue to parse via the default
path.

Fixes #9837

Assisted-by: Claude Code:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-16 18:38:19 +02:00
LocalAI [bot]
00b8989886 chore: ⬆️ Update ggml-org/llama.cpp to 1348f67c58f561808136e8a152a9eddec168f221 (#9842)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 08:41:09 +02:00
LocalAI [bot]
43e0d397ca chore: ⬆️ Update ggml-org/whisper.cpp to 968eebe77225d25e57a3f981da7c696310f0e881 (#9843)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 00:30:04 +02:00
LocalAI [bot]
a1a7a219ed chore: ⬆️ Update antirez/ds4 to 950e8e6474a1c9fabe04e669d607606a7ef8824f (#9844)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 23:46:29 +02:00
LocalAI [bot]
3937ec6527 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 5cc0d86c760e9858e4bed4418400bb39dbe025f2 (#9845)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 23:45:54 +02:00
LocalAI [bot]
1355b55794 chore: ⬆️ Update vllm-project/vllm cu130 wheel to 0.21.0 (#9846)
⬆️ Update vllm-project/vllm cu130 wheel

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 23:45:41 +02:00
Richard Palethorpe
5a2626d465 fix(deps): bump gomarkdown/markdown for GHSA-77fj-vx54-gvh7 (#9841)
Out-of-bounds read in SmartypantsRenderer.smartLeftAngle (CWE-125,
CVSS 7.5). Reachable transitively via LocalAGI's Email connector,
which renders inbound HTML email replies using html.CommonFlags
(includes Smartypants). An unmatched `<` in the inbound body could
panic the agent service.

Bump to v0.0.0-20260411013819-759bbc3e3207 (contains the fix). The
klauspost/compress entry loses its `// indirect` tag because
go mod tidy noticed pkg/utils/untar.go imports it directly.

Assisted-by: Claude:claude-opus-4-7 [Claude-Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-15 21:48:59 +02:00
LocalAI [bot]
a39591f144 realtime: honor output_modalities to skip TTS in text-only mode (#9838)
* realtime: honor output_modalities to skip TTS in text-only mode

The emulated realtime pipeline previously ignored the OpenAI Realtime spec
field output_modalities and always synthesized TTS. Add resolveOutputModalities
+ modalitiesContainAudio helpers and gate the TTS / ResponseOutputAudio*
emission so a client requesting ["text"] gets only ResponseOutputText* events.

This lets thin clients (e.g. thing5-poc) cache TTS on the client side while
still using the realtime WS for VAD + STT + LLM + tool-call parsing.

Assisted-by: Claude:claude-opus-4-7

* realtime: plumb response-level output_modalities and echo on session

Follow-up to the previous commit:
- Resolve response.create's output_modalities at the gate so a per-response
  override of an audio session is honored (the test asserted this contract
  but the production call site was passing nil).
- Mirror OutputModalities in the RealtimeSession echo so session.update
  round-trips the client-supplied value, matching MaxOutputTokens's pattern.

Assisted-by: Claude:claude-opus-4-7

* realtime: silence errcheck on deferred os.Remove of TTS file

CI's errcheck flagged the pre-existing `defer os.Remove(audioFilePath)`
inside the audio-emission block (now wrapped by the modality gate). Wrap
the call in a closure that explicitly discards the error — the canonical
Go pattern for "I want to defer a cleanup whose error I genuinely don't
care about."

Assisted-by: Claude:claude-opus-4-7 golangci-lint

---------

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-15 12:39:47 +02:00
massy_o
8c785dbe4a Validate archive member paths before extraction (#9820)
Signed-off-by: massy-o <telitos000@gmail.com>
2026-05-15 11:12:13 +02:00
LocalAI [bot]
4abf5befbb chore: ⬆️ Update ggml-org/llama.cpp to 834a243664114487f99520370a7a7b00fc7a486f (#9826)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:29:22 +02:00
LocalAI [bot]
195b910260 chore: ⬆️ Update leejet/stable-diffusion.cpp to 0b8296915c4094090cff6bd2e09a5e98288c3c7d (#9827)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:19:52 +02:00
LocalAI [bot]
ba21bf667c docs: ⬆️ update docs version mudler/LocalAI (#9825)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:19:34 +02:00
LocalAI [bot]
7bd1693ad0 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 0fcffdb64d21e57f0778f342415754156e01adfa (#9828)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:08:46 +02:00
LocalAI [bot]
b5ac3a7373 chore: ⬆️ Update ggml-org/whisper.cpp to 46ca43d6399fdeada1b49fb2126ba373bd9ebc38 (#9829)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:08:24 +02:00
LocalAI [bot]
53de474ef5 chore: ⬆️ Update antirez/ds4 to 04b6fda2be395094cbf2d20d921e7a705a4166ef (#9830)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:08:09 +02:00
LocalAI [bot]
c33d36b870 fix(ollama): guard nil filter in galleryop.ListModels (#9817) (#9836)
The Ollama /api/tags handler passes a nil filter to galleryop.ListModels.
When ModelsPath contains any non-skipped loose file the function then
calls filter(name, nil) and panics, which Echo surfaces to clients as
"Server disconnected without sending a response" - the exact failure
Home Assistant's Ollama integration reports against LocalAI.

Mirror the nil guard already present in
ModelConfigLoader.GetModelConfigsByFilter so every caller is safe, and
add a regression test that exercises the loose-file path with a nil
filter.

Assisted-by: claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-15 10:07:50 +02:00
LocalAI [bot]
57fa178a64 feat(swagger): update swagger (#9824)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 09:30:29 +02:00
massy_o
745473cbe6 Validate video image URLs before download (#9819)
Signed-off-by: massy-o <telitos000@gmail.com>
2026-05-14 15:07:17 +02:00
massy_o
594c9fd92e Close Hugging Face scan response body (#9818)
Signed-off-by: massy-o <telitos000@gmail.com>
2026-05-14 12:35:29 +02:00
LocalAI [bot]
8af963bdd9 fix(streaming): comply with OpenAI usage / stream_options spec (#9815)
* fix(streaming): comply with OpenAI usage / stream_options spec (#8546)

LocalAI emitted `"usage":{"prompt_tokens":0,...}` on every streamed
chunk because `OpenAIResponse.Usage` was a value type without
`omitempty`. The official OpenAI Node SDK and its consumers
(continuedev/continue, Kilo Code, Roo Code, Zed, IntelliJ Continue)
filter on a truthy `result.usage` to detect the trailing usage chunk;
LocalAI's zero-but-non-null usage on every intermediate chunk made
that filter swallow every content chunk and surface an empty chat
response while the server log looked successful.

Changes:

- `core/schema/openai.go`: `Usage *OpenAIUsage \`json:"usage,omitempty"\``
  so intermediate chunks no longer carry a `usage` key. Add
  `OpenAIRequest.StreamOptions` with `include_usage` to mirror OpenAI's
  request field.
- `core/http/endpoints/openai/chat.go` and `completion.go`: keep using
  the `Usage` struct field as an in-process channel for the running
  cumulative, but strip it before JSON marshalling. When the request
  set `stream_options.include_usage: true`, emit a dedicated trailing
  chunk with `"choices": []` and the populated usage (matching the
  OpenAI spec and llama.cpp's server behavior).
- `chat_emit.go`: new `streamUsageTrailerJSON` helper; drop the
  `usage` parameter from `buildNoActionFinalChunks` since chunks no
  longer carry usage.
- Update `image.go`, `inpainting.go`, `edit.go` to wrap their Usage
  values with `&` for the new pointer field.
- UI: send `stream_options:{include_usage:true}` from the React
  (`useChat.js`) and legacy (`static/chat.js`) chat clients so the
  token-count badge keeps populating now that the server is
  spec-compliant.

Tests:

- New `chat_stream_usage_test.go` pins the spec invariants:
  intermediate chunks have no `usage` key, the trailer JSON has
  `"choices":[]` and a populated `usage`, and `OpenAIRequest` parses
  `stream_options.include_usage`.
- Update `chat_emit_test.go` to reflect that finals no longer embed
  usage.

Verified against the live LocalAI instance: before the fix Continue's
filter logic swallowed 16/16 token chunks; with the new shape it
yields 4/5 and routes usage through the dedicated trailer chunk.

Fixes #8546

Assisted-by: Claude:opus-4.7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(streaming): silence errcheck on usage trailer Fprintf

The new spec-compliant `stream_options.include_usage` trailer writes
were flagged by errcheck since they're new code (golangci-lint runs
new-from-merge-base on master); the surrounding `fmt.Fprintf` data:
writes are grandfathered. Drop the return values explicitly to match
the linter's contract without adding a nolint shim.

Assisted-by: Claude:opus-4.7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-14 08:53:46 +02:00
LocalAI [bot]
6e1dbae256 feat(llama-cpp): expose 12 missing common_params via options[] (#9814)
The llama.cpp backend already accepts a free-form options: array in the
model config that maps to common_params fields, but a coverage audit
against upstream pin 7f3f843c flagged 12 user-visible knobs that were
neither set via the typed proto fields nor reachable via options:.

Wire them up under the existing if/else chain in params_parse, before
the speculative section. Each new option follows the file's prevailing
patterns (try/catch around numeric parses, the same true/1/yes/on bool
form used elsewhere, hardware_concurrency() fallback for thread counts,
mirror of draft_override_tensor for override_tensor).

Top-level / batching / IO:
  - n_ubatch (alias ubatch) -- physical batch size; was previously
    force-aliased to n_batch at line 482, blocking embedding/rerank
    workloads that need independent control
  - threads_batch (alias n_threads_batch) -- main-model batch threads;
    mirrors the existing draft_threads_batch
  - direct_io (alias use_direct_io) -- O_DIRECT model loads
  - verbosity -- llama.cpp log threshold (line 479 had this commented
    out)
  - override_tensor (alias tensor_buft_overrides) -- per-tensor buffer
    overrides for the main model; mirrors draft_override_tensor

Embedding / multimodal:
  - pooling_type (alias pooling) -- mean/cls/last/rank/none; previously
    only auto-flipped to RANK for rerankers
  - embd_normalize (alias embedding_normalize) -- and the embedding
    handler now reads params_base.embd_normalize instead of a hardcoded
    2 at the previous embd_normalize literal in Embedding()
  - mmproj_use_gpu (alias mmproj_offload) -- mmproj on CPU vs GPU
  - image_min_tokens / image_max_tokens -- per-image vision token budget

Reasoning surface (the audit-focus three; LocalAI's existing
ReasoningConfig.DisableReasoning only feeds the per-request
chat_template_kwargs.enable_thinking and does not touch any of these):
  - reasoning_format -- none/auto/deepseek/deepseek-legacy parser
  - enable_reasoning (alias reasoning_budget) -- -1/0/>0 thinking budget
  - prefill_assistant -- trailing-assistant-message prefill toggle

All 14 referenced fields exist on both the upstream pin and the
turboquant fork's common.h, so no LOCALAI_LEGACY_LLAMA_CPP_SPEC guard
is needed.

Docs: extend model-configuration.md with new "Reasoning Models",
"Multimodal Backend Options", "Embedding & Reranking Backend Options",
and "Other Backend Tuning Options" subsections; also refresh the
Speculative Type Values table to show the new dash-separated canonical
names alongside the underscore aliases LocalAI still accepts.


Assisted-by: claude-code:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-14 08:53:34 +02:00
LocalAI [bot]
53bdb18d10 chore: ⬆️ Update ggml-org/llama.cpp to 7f3f843c31cd32dc4adc10b393342dfee071c332 (#9809)
* ⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(llama-cpp): adapt to upstream COMMON_SPECULATIVE_TYPE_DRAFT rename

ggml-org/llama.cpp#22964 ("spec: update CLI arguments for better
consistency") renamed the speculative type enum values:
  COMMON_SPECULATIVE_TYPE_DRAFT  -> COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE
  COMMON_SPECULATIVE_TYPE_EAGLE3 -> COMMON_SPECULATIVE_TYPE_DRAFT_EAGLE3
and the registered name strings flipped from underscore- to dash-
separated form (e.g. ngram_simple -> ngram-simple), with the bare
draft/eagle3 aliases replaced by draft-simple/draft-eagle3.

This broke the build with the new LLAMA_VERSION on every variant
(vulkan/arm64, darwin and likely all the rest) at grpc-server.cpp:461.

Update the upstream branch of the speculative-type fallback to use the
new identifier (the LOCALAI_LEGACY_LLAMA_CPP_SPEC fork branch keeps the
old name), and normalize spec_type option tokens before passing them to
common_speculative_types_from_names so existing model configs that say
spec_type:draft / spec_type:ngram_simple keep working.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-14 08:53:23 +02:00
LocalAI [bot]
42a8db3573 ci(image): publish missing :latest-* and :v<X>-* singleton image tags (#9812)
* ci(image): wire singleton merges + `--` artifact separator

Closes the same singletons gap on the LocalAI server image workflow that
PR #9781 closed for backends. The user observed it as missing
:latest-gpu-nvidia-cuda-12 etc. on quay.io/go-skynet/local-ai — the
build matrix has six single-arch entries with no corresponding merge
step, so their per-arch digests push (push-by-digest=true) and never
get tagged:

  - -gpu-hipblas              (hipblas-jobs)
  - -gpu-nvidia-cuda-12       (core-image-build)
  - -gpu-nvidia-cuda-13       (core-image-build)
  - -gpu-intel                (core-image-build)
  - -nvidia-l4t-arm64         (gh-runner)
  - -nvidia-l4t-arm64-cuda-13 (gh-runner)

Only :latest, :v<X>, :latest-gpu-vulkan and :v<X>-gpu-vulkan were
actually being published before this commit (the two multiarch suffixes
that had merge jobs).

Changes:

1. image.yml: add six new merge jobs, one per single-arch entry. Each
   `needs:` only its parent build job (matching the existing pattern
   for core-image-merge / gpu-vulkan-image-merge).
2. image_build.yml: switch artifact name to
   `digests-localai<suffix>--<platform-tag-or-"single">`. The `--`
   separator anchors the merge-side glob so a singleton tag-suffix
   doesn't over-match a longer suffix that shares its prefix
   (-nvidia-l4t-arm64 vs -nvidia-l4t-arm64-cuda-13). Same convention
   as backend_build.yml's fix.
3. image_merge.yml: update the download pattern to match.

Next master push or tag release should produce :latest-gpu-hipblas,
:latest-gpu-nvidia-cuda-12, :latest-gpu-nvidia-cuda-13, :latest-gpu-intel,
:latest-nvidia-l4t-arm64, :latest-nvidia-l4t-arm64-cuda-13 (and their
:v<X>-* equivalents) for the first time on the post-#9781 workflow.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(image): add !cancelled() guard to all 8 image merge jobs

Parity pass with backend.yml's merge jobs (8521af14). Without
!cancelled(), GHA's default `needs:` cascade skips the merge when ANY
matrix cell of the parent build job fails or is cancelled — so a single
flaky leg would suppress publication of every other tag-suffix's
manifest list. Same fix the backend got after v4.2.1 showed 2 failed
singlearch builds cascade-skip 199 singlearch merge entries.

Applied to all 8 image merges:

  - core-image-merge
  - gpu-vulkan-image-merge
  - gpu-nvidia-cuda-12-image-merge       (added in e5300f1a)
  - gpu-nvidia-cuda-13-image-merge       (added in e5300f1a)
  - gpu-intel-image-merge                (added in e5300f1a)
  - gpu-hipblas-image-merge              (added in e5300f1a)
  - nvidia-l4t-arm64-image-merge         (added in e5300f1a)
  - nvidia-l4t-arm64-cuda-13-image-merge (added in e5300f1a)

Build jobs (hipblas-jobs, core-image-build, gh-runner) are
intentionally NOT changed — they have no upstream `needs:` to cascade-
skip from.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-14 00:28:48 +02:00
LocalAI [bot]
0353d3bd77 chore: ⬆️ Update ggml-org/whisper.cpp to 3e9b7d0fef3528ee2208da3cdb873a2c53d2ae2f (#9808)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-14 00:20:14 +02:00
LocalAI [bot]
ec49995190 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 949bb8f1d660fc1264c137a6f3dbd619375f6134 (#9807)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-14 00:15:32 +02:00
Tai An
67c34bbb96 fix(middleware): parse OpenAI-spec tool_choice in /v1/chat/completions (#9559)
* fix(middleware): parse OpenAI-spec tool_choice in /v1/chat/completions

Follows up on #9526 (the 3-site setter fix) by addressing the remaining
clause in #9508 — string mode and OpenAI-spec specific-function shape both
silently failed in the /v1/chat/completions parsing path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(middleware): restore LF endings and cover tool_choice parsing with specs

The previous commit on this branch saved core/http/middleware/request.go
with CRLF line endings, ballooning the diff against master to 684 / 651
for what is in reality a ~50-line parsing change. Restore LF (matches
.editorconfig end_of_line = lf).

Add 11 Ginkgo specs under "SetModelAndConfig tool_choice parsing
(chat completions)" that parallel the existing MergeOpenResponsesConfig
specs from #9509. They drive the full middleware chain (SetModelAndConfig
+ SetOpenAIRequest) and assert:

  * "required"  -> ShouldUseFunctions=true, no specific name
  * "none"      -> ShouldUseFunctions=false (tools disabled per OpenAI spec)
  * "auto"      -> default, tools available, no specific name
  * {type:function, function:{name:X}}  (spec)    -> X is forced
  * {type:function, name:X}             (legacy)  -> X is forced
  * nested wins when both forms are present
  * malformed shapes (no type, wrong type, no name, empty name) are no-ops

Update the inline comment on the string case to describe the actual
mechanism: "none" reaches SetFunctionCallString("none") downstream and
is then honored by ShouldUseFunctions() returning false. Before this PR
json.Unmarshal([]byte("none"), &functions.Tool{}) failed silently, so
"none" was ignored - making "none" actually work is a real behavior fix
this PR brings.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Claude Code]

* fix(middleware): preserve pre-#9559 support for JSON-string-encoded tool_choice

Some non-spec clients send tool_choice as a JSON-encoded string of an
object form, e.g. "{\"type\":\"function\",\"function\":{\"name\":\"X\"}}".
The pre-#9559 code accepted this by accident: its case string: branch
ran json.Unmarshal([]byte(content), &functions.Tool{}), which succeeded
for that double-encoded shape even though it failed for the legitimate
plain string modes "auto" / "none" / "required".

The first version of this PR routed every string straight to
SetFunctionCallString as a mode, which fixed the plain-string cases but
silently regressed the double-encoded one (funcs.Select("{...}") returns
nothing). Restore the fallback: when a string looks like a JSON object,
try parsing it as a tool_choice map first; fall through to mode-string
handling only when no usable name comes out.

Factor the map-name extraction into a small helper
(extractToolChoiceFunctionName) so the string-fallback and the regular
map case go through identical code, and accept both the OpenAI-spec
nested shape and the legacy/Anthropic flat shape from either entry
point.

Add 3 Ginkgo specs covering the double-encoded case (nested form, legacy
form, and the fall-through when the JSON has no usable name).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Claude Code]

* test(middleware): silence errcheck on AfterEach os.RemoveAll

The new tool_choice parsing tests added a second AfterEach that calls
os.RemoveAll(modelDir) without checking the error; errcheck flagged it.
Suppress with the standard _ = idiom. The pre-existing AfterEach on the
earlier Describe still elides the check the same way it did before -
leaving that untouched to keep this commit minimal.

Assisted-by: Claude:opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-14 00:14:38 +02:00
LocalAI [bot]
4430fae779 chore: ⬆️ Update antirez/ds4 to 0cba357ca1bc0e7510421cc26888e420ea942123 (#9806)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-14 00:14:23 +02:00
LocalAI [bot]
ab01ed1a3e fix(agentpool): close truncate-then-read race in agent_jobs.json persistence (#9811)
* fix(agentpool): close truncate-then-read race in agent_jobs.json persistence

Three call sites wrote and read agent_jobs.json (and agent_tasks.json)
through three independent mutexes:

  - AgentJobService.ExecuteJob spawns go saveJobs(job) -> fileJobPersister
    holding p.mu
  - AgentJobService.SaveJobsToFile holding service.fileMutex
  - AgentJobService.LoadJobsFromFile on a separate service instance holding
    a different service.fileMutex

Nothing serialized those mutexes, and both writers used os.WriteFile, which
opens O_TRUNC. A reader landing between the truncate and the write saw a
zero-byte file and surfaced as `unexpected end of JSON input` at offset 0.
The macOS tests-apple job started hitting this consistently once the path
filter was removed from .github/workflows/test.yml and the file-mode race
test ran on every push (run 25823124797 was the first observed failure).

Two changes close the window:

1. fileJobPersister.saveTasksToFile / saveJobsToFile now write to a
   same-directory temp file and os.Rename to the final path. rename(2) is
   atomic on POSIX, so concurrent readers see either the prior contents or
   the new contents and never a zero-byte window. The helper Syncs before
   close so a crash mid-write leaves either the old file intact or the temp
   behind (cleaned up on next save).

2. AgentJobService.{Load,Save}{Tasks,Jobs}{FromFile,ToFile} are collapsed
   to thin wrappers around fileJobPersister, removing the duplicate write
   path and the redundant service.fileMutex / service.tasksFile /
   service.jobsFile fields. Within a single service all task/job I/O now
   serializes on the persister's mutex; the atomic rename handles the
   cross-instance case the tests exercise.

Adds a regression test that hammers SaveJobsToFile and LoadJobsFromFile
concurrently for 500ms across two service instances on the same paths.
On master this reproduces `unexpected end of JSON input` on Linux within
~500ms; with the fix the suite ran -until-it-fails for 30s (54 attempts,
all green).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(agentpool): route service flush/load through JobPersister interface

The first cut of the race fix made AgentJobService.{Save,Load}{Tasks,Jobs}*
type-assert s.persister to *fileJobPersister so they could reach the
unexported saveTasksToFile / saveJobsToFile helpers. That defeats the
JobPersister interface: the service is back to reasoning about a concrete
implementation instead of an abstraction.

Promote the bulk-flush operations to the interface as FlushTasks / FlushJobs:

  - fileJobPersister.FlushTasks/FlushJobs call the existing private helpers
    (atomic temp+rename writes from the prior commit).
  - dbJobPersister.FlushTasks/FlushJobs are no-ops because SaveTask/SaveJob
    are already write-through to the database.

The service's four file-named methods now talk only to the interface:
LoadTasks/LoadJobs read through s.persister.LoadTasks/LoadJobs, and the
Save side calls FlushTasks/FlushJobs. The "FromFile"/"ToFile" suffixes
stay for backward compat with user_services.go and the existing tests,
but they no longer claim a file-only contract.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-13 23:58:43 +02:00
Ettore Di Giacinto
6bfe7f8c05 ci(image-merge): apply the keepalive+ci-cache source fix to image_merge
Mirror of 8521af14 (which fixed backend_merge.yml) for image_merge.yml.
Today's master-push run 25823024353 failed the gpu-vulkan-image-merge job
with the exact same error pattern the backend merge had on v4.2.2:

  ERROR: quay.io/go-skynet/local-ai@sha256:68b22611...: not found

Same root cause: image_build.yml pushes the per-arch manifest to
quay.io/go-skynet/local-ai with push-by-digest=true (no tag), then the
merge runs minutes-to-hours later, by which time quay's per-repo manifest
GC has reaped the untagged digest from local-ai. The blob still lives in
quay's storage but local-ai@<digest> no longer resolves.

Three matching edits:

1. image_build.yml: anchor each per-arch digest into ci-cache immediately
   after the push, reusing .github/scripts/anchor-digest-in-cache.sh with
   SOURCE_IMAGE=quay.io/go-skynet/local-ai and TAG_SUFFIX defaulting to
   "-core" for the core image (matches the artifact-name convention).
2. image_merge.yml: change the quay merge source from local-ai@<digest>
   to ci-cache@<digest>. Same correctness argument as backend_merge.yml —
   the manifest content is alive in ci-cache; buildx imagetools create
   republishes it into local-ai and writes the user-facing manifest list
   pointing at it. End state in local-ai is self-contained.
3. image_merge.yml: add a sparse `actions/checkout@v6` (only
   .github/scripts) so cleanup-keepalive-tags.sh is available, plus the
   cleanup step itself with TAG_SUFFIX matching the anchor's "-core"
   placeholder.

v4.2.3's image.yml run completed successfully (~50 min between push and
merge — beat quay's GC). This commit closes the race for future releases
and master pushes regardless of run length.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-13 21:37:03 +00:00
LocalAI [bot]
5a42dbf3ec docs: ⬆️ update docs version mudler/LocalAI (#9805)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 22:42:54 +02:00
Adira
c2fe0a6475 fix(http): honor X-Forwarded-Prefix when proxy strips the prefix (#9614)
* fix(http): honor X-Forwarded-Prefix when proxy strips the prefix

Closes #9145.

Two related issues kept the React UI from loading when a reverse proxy
rewrites a sub-path with prefix-stripping (e.g. Caddy `handle_path`):

1. `BaseURL` only computed a prefix from the path StripPathPrefix had
   removed, so when the proxy strips the prefix before forwarding, the
   request arrives without it and the base URL was returned without a
   prefix. Extract a `BasePathPrefix` helper and add an
   `X-Forwarded-Prefix` header fallback so the prefix is recovered.
2. `<base href>` only changes how relative URLs resolve; the build
   emits path-absolute references like `/assets/...` and
   `/favicon.svg`, which still resolve against the origin and bypass
   the proxy prefix. Rewrite those references in the served
   `index.html` so the browser requests them through the proxy.

Adds unit coverage for `BaseURL` with a pre-stripped path and an
end-to-end test for the proxy-stripped scenario.

Assisted-by: Claude:claude-opus-4-7

* fix(http): gate X-Forwarded-Prefix through SafeForwardedPrefix in BasePathPrefix

BasePathPrefix consumed X-Forwarded-Prefix directly, so a value the
codebase elsewhere rejects (e.g. "//evil.com") slipped through and was
interpolated into the SPA index.html — both into the path-absolute asset
URL rewrite in serveIndex (turning "/assets/..." into "//evil.com/assets/...",
a protocol-relative URL that loads JS from a foreign origin) and into
<base href>. Route the header through the existing SafeForwardedPrefix
validator that StripPathPrefix and prefixRedirect already use, and
HTML-escape the prefix before injecting it into the asset rewrite as
defense in depth against attribute breakout.

Tests cover //evil.com, backslashes, control chars, CR/LF and a missing
leading slash; the integration test asserts an unsafe prefix can't poison
asset URLs.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-13 21:59:33 +02:00
LocalAI [bot]
ddbbdf45b9 chore: ⬆️ Update TheTom/llama-cpp-turboquant to 5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403 (#9740)
⬆️ Update TheTom/llama-cpp-turboquant

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 21:58:33 +02:00
LocalAI [bot]
b4fdb41dcc fix(distributed): cascade-clean stale node_models rows + filter routing by healthy status (#9754)
* fix(distributed): cascade-clean stale node_models on drain and filter routing by healthy status

Stale node_models rows (state="loaded") were surviving past the healthy
state of their owning node, causing /embeddings (and other inference
paths) to dispatch to a backend whose process was gone or drained. The
downstream symptom in a live cluster was pgvector rejecting inserts
with "vector cannot have more than 16000 dimensions (SQLSTATE 54000)"
because the misbehaving backend silently returned a malformed
(oversized) tensor; the Models page showed the model as "running"
without an associated node, like a stale entry, even though the node
was no longer visible in the Nodes view.

Two changes here, plus a third in a follow-up commit:

- MarkDraining now cascade-deletes node_models rows for the affected
  node, mirroring MarkOffline. Drains are explicit operator actions —
  the box has been intentionally taken out of rotation — so clearing
  the rows stops the Models UI from misreporting and prevents the
  routing layer from picking those rows if scheduling logic is ever
  relaxed. In-flight requests already hold their gRPC client through
  Route() and finish normally; the only observable effect is a
  non-fatal IncrementInFlight warning, acceptable for a drain.

  MarkUnhealthy is deliberately left status-only: it fires from
  managers_distributed / reconciler on a single nats.ErrNoResponders
  with no retry, so a transient NATS hiccup must not nuke every loaded
  model and force a full reload on recovery.

- FindAndLockNodeWithModel's inner JOIN now filters on
  backend_nodes.status = healthy in addition to node_models.state =
  loaded. The previous version relied on the second node-fetch step to
  reject non-healthy nodes, but a concurrent reader could still pick
  the same stale row in the same window. Belt-and-braces.

- DistributedConfig.PerModelHealthCheck renamed to
  DisablePerModelHealthCheck and inverted at the call site so
  per-model gRPC probing is on by default. The probe (now made
  consecutive-miss aware in a follow-up commit) independently health-
  checks each model's gRPC address and removes stale node_models rows
  when the backend has crashed even though the worker's node-level
  heartbeat is still arriving.

  Migration: the field had no CLI flag, env var binding, or YAML key
  in tree (only the bare struct field), so there is no user-facing
  migration. Anything constructing DistributedConfig in code needs to
  drop the assignment (default now does the right thing) or invert it.

Assisted-by: Claude:claude-opus-4-7 go-vet go-test golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): require consecutive misses before per-model probe removes a row

The per-model gRPC probe used to remove a node_models row on a single
failed health check. With the per-model probe now on by default, that
made any 5-second gRPC blip (network jitter, a long-running request
hogging the worker's gRPC server thread, brief GC pause) trigger a
full reload of the affected model — too eager for production.

Require perModelMissThreshold (3) consecutive failed probes before
removal. At the default 15s tick a model must be unreachable for ~45s
before reap; a single successful probe in between resets the streak.
Per-(node, model, replica) state tracked under a mutex on the monitor.

If the removal call itself fails, the miss counter is left in place
so the next tick retries rather than starting the streak over.

Tests:
- removes stale model via per-model health check after consecutive
  failures (replaces the single-shot expectation)
- preserves model row when an intermittent failure is followed by a
  success (covers the reset-on-success path and verifies the counter
  reset by failing twice more without crossing threshold)
- newTestHealthMonitor initializes the misses map so direct-construct
  test helpers don't nil-map-panic in the probe path

Assisted-by: Claude:claude-opus-4-7 go-vet go-test golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-13 21:57:50 +02:00
Richard Palethorpe
0245b33eab feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page (#9801)
* feat(liquid-audio): add LFM2.5-Audio any-to-any backend + realtime_audio usecase

Wires LiquidAI's LFM2.5-Audio-1.5B as a self-contained Realtime API model:
single engine handles VAD, transcription, LLM, and TTS in one bidirectional
stream — drop-in alternative to a VAD+STT+LLM+TTS pipeline.

Backend
- backend/python/liquid-audio/ — new Python gRPC backend wrapping the
  `liquid-audio` package. Modes: chat / asr / tts / s2s, voice presets,
  Load/Predict/PredictStream/AudioTranscription/TTS/VAD/AudioToAudioStream/
  Free and StartFineTune/FineTuneProgress/StopFineTune. Runtime monkey-patch
  on `liquid_audio.utils.snapshot_download` so absolute local paths from
  LocalAI's gallery resolve without a HF round-trip. soundfile in place of
  torchaudio.load/save (torchcodec drags NVIDIA NPP we don't bundle).
- backend/backend.proto + pkg/grpc/{backend,client,server,base,embed,
  interface}.go — new AudioToAudioStream RPC mirroring AudioTransformStream
  (config/frame/control oneof in; typed event+pcm+meta out).
- core/services/nodes/{health_mock,inflight}_test.go — add stubs for the
  new RPC to the test fakes.

Config + capabilities
- core/config/backend_capabilities.go — UsecaseRealtimeAudio, MethodAudio
  ToAudioStream, UsecaseInfoMap entry, liquid-audio BackendCapability row.
- core/config/model_config.go — FLAG_REALTIME_AUDIO bitmask, ModalityGroups
  membership in both speech-input and audio-output groups so a lone flag
  still reads as multimodal, GetAllModelConfigUsecases entry, GuessUsecases
  branch.

Realtime endpoint
- core/http/endpoints/openai/realtime.go — extract prepareRealtimeConfig()
  so the gate is unit-testable; accept realtime_audio models and self-fill
  empty pipeline slots with the model's own name (user-pinned slots win).
- core/http/endpoints/openai/realtime_gate_test.go — six specs covering nil
  cfg, empty pipeline, legacy pipeline, self-contained realtime_audio,
  user-pinned VAD slot, and partial legacy pipeline.

UI + endpoints
- core/http/routes/ui.go — /api/pipeline-models accepts either a legacy
  VAD+STT+LLM+TTS pipeline or a realtime_audio model; surfaces a
  self_contained flag so the Talk page can collapse the four cards.
- core/http/routes/ui_api.go — realtime_audio in usecaseFilters.
- core/http/routes/ui_pipeline_models_test.go — covers both code paths.
- core/http/react-ui/src/pages/Talk.jsx — self-contained badge instead of
  the four-slot grid; rename Edit Pipeline → Edit Model Config; less
  pipeline-specific wording.
- core/http/react-ui/src/pages/Models.jsx + locales/en/models.json — new
  realtime_audio filter button + i18n.
- core/http/react-ui/src/utils/capabilities.js — CAP_REALTIME_AUDIO.
- core/http/react-ui/src/pages/FineTune.jsx — voice + validation-dataset
  fields, surfaced when backend === liquid-audio, plumbed via
  extra_options on submit/export/import.

Gallery + importer
- gallery/liquid-audio.yaml — config template with known_usecases:
  [realtime_audio, chat, tts, transcript, vad].
- gallery/index.yaml — four model entries (realtime/chat/asr/tts) keyed by
  mode option. Fixed pre-existing `transcribe` typo on the asr entry
  (loader silently dropped the unknown string → entry never surfaced as a
  transcript model).
- gallery/lfm.yaml — function block for the LFM2 Pythonic tool-call format
  `<|tool_call_start|>[name(k="v")]<|tool_call_end|>` matching
  common_chat_params_init_lfm2 in vendored llama.cpp.
- core/gallery/importers/{liquid-audio,liquid-audio_test}.go — detector
  matches LFM2-Audio HF repos (excludes -gguf mirrors); mode/voice
  preferences plumbed through to options.
- core/gallery/importers/importers.go — register LiquidAudioImporter
  before LlamaCPPImporter.
- pkg/functions/parse_lfm2_test.go — seven specs for the response/argument
  regex pair on the LFM2 pythonic format.

Build matrix
- .github/backend-matrix.yml — seven liquid-audio targets (cuda12, cuda13,
  l4t-cuda-13, hipblas, intel, cpu amd64, cpu arm64). Jetpack r36 cuda-12
  is skipped (Ubuntu 22.04 / Python 3.10 incompatible with liquid-audio's
  3.12 floor).
- backend/index.yaml — anchor + 13 image entries.
- Makefile — .NOTPARALLEL, prepare-test-extra, test-extra,
  docker-build-liquid-audio.

Docs
- .agents/plans/liquid-audio-integration.md — phased plan; PR-D (real
  any-to-any wiring via AudioToAudioStream), PR-E (mid-audio tool-call
  detector), PR-G (GGUF entries once upstream llama.cpp PR #18641 lands)
  remain.
- .agents/api-endpoints-and-auth.md — expand the capability-surface
  checklist with every place a new FLAG_* needs to be registered.

Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(realtime): function calling + history cap for any-to-any models

Three pieces, all on the realtime_audio path that just landed:

1. liquid-audio backend (backend/python/liquid-audio/backend.py):
   - _build_chat_state grows a `tools_prelude` arg.
   - new _render_tools_prelude parses request.Tools (the OpenAI Chat
     Completions function array realtime.go already serialises) and
     emits an LFM2 `<|tool_list_start|>…<|tool_list_end|>` system turn
     ahead of the user history. Mirrors gallery/lfm.yaml's `function:`
     template so the model sees the same prompt shape whether served
     via llama-cpp or here. Without this the backend silently dropped
     tools — function calling was wired end-to-end on the Go side but
     the model never saw a tool list.

2. Realtime history cap (core/http/endpoints/openai/realtime.go):
   - Session grows MaxHistoryItems int; default picked by new
     defaultMaxHistoryItems(cfg) — 6 for realtime_audio models (LFM2.5
     1.5B degrades quickly past a handful of turns), 0/unlimited for
     legacy pipelines composing larger LLMs.
   - triggerResponse runs conv.Items through trimRealtimeItems before
     building conversationHistory. Helper walks the cut left if it
     would orphan a function_call_output, so tool result + call pairs
     stay intact.
   - realtime_gate_test.go: specs for defaultMaxHistoryItems and
     trimRealtimeItems (zero cap, under cap, over cap, tool-call pair
     preservation).

3. Talk page (core/http/react-ui/src/pages/Talk.jsx):
   - Reuses the chat page's MCP plumbing — useMCPClient hook,
     ClientMCPDropdown component, same auto-connect/disconnect effect
     pattern. No bespoke tool registry, no new REST endpoints; tools
     come from whichever MCP servers the user toggles on, exactly as
     on the chat page.
   - sendSessionUpdate now passes session.tools=getToolsForLLM(); the
     update re-fires when the active server set changes mid-session.
   - New response.function_call_arguments.done handler executes via
     the hook's executeTool (which round-trips through the MCP client
     SDK), then replies with conversation.item.create
     {type:function_call_output} + response.create so the model
     completes its turn with the tool output. Mirrors chat's
     client-side agentic loop, translated to the realtime wire shape.

UI changes require a LocalAI image rebuild (Dockerfile:308-313 bakes
react-ui/dist into the runtime image). Backend.py changes can be
swapped live in /backends/<id>/backend.py + /backend/shutdown.

Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(realtime): LocalAI Assistant ("Manage Mode") for the Talk page

Mirrors the chat-page metadata.localai_assistant flow so users can ask the
realtime model what's loaded / installed / configured. Tools are run
server-side via the same in-process MCP holder that powers the chat
modality — no transport switch, no proxy, no new wire protocol.

Wire:
- core/http/endpoints/openai/realtime.go:
  - RealtimeSessionOptions{LocalAIAssistant,IsAdmin}; isCurrentUserAdmin
    helper mirrors chat.go's requireAssistantAccess (no-op when auth
    disabled, else requires auth.RoleAdmin).
  - Session grows AssistantExecutor mcpTools.ToolExecutor.
  - runRealtimeSession, when opts.LocalAIAssistant is set: gate on admin,
    fail closed if DisableLocalAIAssistant or the holder has no tools,
    DiscoverTools and inject into session.Tools, prepend
    holder.SystemPrompt() to instructions.
  - Tool-call dispatch loop: when AssistantExecutor.IsTool(name), run
    ExecuteTool inproc, append a FunctionCallOutput to conv.Items, skip
    the function_call_arguments client emit (the client can't execute
    these — it doesn't know about them). After the loop, if any
    assistant tool ran, trigger another response so the model speaks the
    result. Mirrors chat's agentic loop, driven server-side rather than
    via client round-trip.

- core/http/endpoints/openai/realtime_webrtc.go: RealtimeCallRequest
  gains `localai_assistant` (JSON omitempty). Handshake calls
  isCurrentUserAdmin and builds RealtimeSessionOptions.

- core/http/react-ui/src/pages/Talk.jsx: admin-only "Manage Mode"
  checkbox under the Tools dropdown; passes localai_assistant: true to
  realtimeApi.call's body, captured in the connect callback's deps.

Mirroring chat's pattern means the in-process MCP tools surface "just
works" for the Talk page without exposing a Streamable-HTTP MCP endpoint
(which was the alternative). Clients with their own MCP servers can
still use the existing ClientMCPDropdown path in parallel; the realtime
handler distinguishes them by AssistantExecutor.IsTool() at dispatch
time.

Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(realtime): render Manage Mode tool calls in the Talk transcript

Previously the realtime endpoint only emitted response.output_item.added
for the FunctionCall item, and Talk.jsx's switch ignored the event — so
server-side tool runs were invisible in the UI. The model would speak
the result but the user had no way to see what tool was actually
called.

realtime.go: after executing an assistant tool inproc, emit a second
output_item.added/.done pair for the FunctionCallOutput item. Mirrors
the way the chat page displays tool_call + tool_result blocks.

Talk.jsx: handle both response.output_item.added and .done. Render
FunctionCall (with arguments) and FunctionCallOutput (pretty-printed
JSON when possible) as two transcript entries — `tool_call` with the
wrench icon, `tool_result` with the clipboard icon, both in mono-space
secondary-colour. Resets streamingRef after the result so the next
assistant text delta starts a fresh transcript entry instead of
appending to the previous turn.

Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* refactor(realtime): bound the Manage Mode tool-loop + preserve assistant tools

Fallout from a review pass on the Manage Mode patches:

- Bound the server-side agentic loop. triggerResponse used to recurse on
  executedAssistantTool with no cap — a model that kept calling tools
  would blow the goroutine stack. New maxAssistantToolTurns = 10 (mirrors
  useChat.js's maxToolTurns). Public triggerResponse is now a thin shim
  over triggerResponseAtTurn(toolTurn int); recursion increments the
  counter and stops at the cap with an xlog.Warn.

- Preserve Manage Mode tools across client session.update. The handler
  used to blindly overwrite session.Tools, so toggling a client MCP
  server mid-session silently wiped the in-process admin tools. Session
  now caches the original AssistantTools slice at session creation and
  the session.update handler merges them back in (client names win on
  collision — the client is explicit).

- strconv.ParseBool for the localai_assistant query param instead of
  hand-rolled "1" || "true". Mirrors LocalAIAssistantFromMetadata.

- Talk.jsx: render both tool_call and tool_result on
  response.output_item.done instead of splitting them across .added and
  .done. The server's event pairing (added → done) stays correct; the
  UI just doesn't need to inspect both phases of the same item. One
  switch case instead of two, no behavioural change.

Out of scope (noted for follow-ups): extract a shared assistant-tools
helper between chat.go and realtime.go (duplication is small enough
that two parallel implementations stay readable for now), and an i18n
key for the Manage Mode helper text (Talk.jsx doesn't use i18n
anywhere else yet).

Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* ci(test-extra): wire liquid-audio backend smoke test

The backend ships test.py + a `make test` target and is listed in
backend-matrix.yml, so scripts/changed-backends.js already writes a
`liquid-audio=true|false` output when files under backend/python/liquid-audio/
change. The workflow just wasn't reading it.

- Expose the `liquid-audio` output on the detect-changes job
- Add a tests-liquid-audio job that runs `make` + `make test` in
  backend/python/liquid-audio, gated on the per-backend detect flag

The smoke covers Health() and LoadModel(mode:finetune); fine-tune mode
short-circuits before any HuggingFace download (backend.py:192), so the
job needs neither weights nor a GPU. The full-inference path remains
gated on LIQUID_AUDIO_MODEL_ID, which CI doesn't set.

The four new Go test files (core/gallery/importers/liquid-audio_test.go,
core/http/endpoints/openai/realtime_gate_test.go,
core/http/routes/ui_pipeline_models_test.go, pkg/functions/parse_lfm2_test.go)
are already picked up by the existing test.yml workflow via `make test` →
`ginkgo -r ./pkg/... ./core/...`; their packages all carry RunSpecs entries.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-13 21:57:27 +02:00
Andreas Egli
a2940e5d47 feat: also parse VRAM budget/usage from vulkaninfo (#9800)
Signed-off-by: Andreas Egli <github@kharan.ch>
2026-05-13 21:43:12 +02:00
LocalAI [bot]
a645c1f4aa chore: ⬆️ Update ggml-org/llama.cpp to a9883db8ee021cf16783016a60996d41820b5195 (#9796)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 21:40:31 +02:00
LocalAI [bot]
957619af53 chore: ⬆️ Update ikawrakow/ik_llama.cpp to f9a93c37e2fc021760c3c1aa99cf74c73b7591a7 (#9795)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 00:40:48 +02:00
LocalAI [bot]
ad0ab37230 docs: ⬆️ update docs version mudler/LocalAI (#9792)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 00:40:37 +02:00
LocalAI [bot]
0b81e36504 chore: ⬆️ Update antirez/ds4 to f8b4ed635d559b3a5b44bf2df6a77e21b3e9178f (#9794)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 00:40:09 +02:00
LocalAI [bot]
602866a9d8 chore: ⬆️ Update ggml-org/whisper.cpp to 338cce1e58133261753243802a0e7a430118866d (#9793)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 00:39:57 +02:00
Ettore Di Giacinto
8521af145f ci(merge): source per-arch digests from ci-cache, not local-ai-backends
Follow-up to PR #9781. v4.2.2 (run 25745181433) showed the keepalive
anchor in ci-cache wasn't enough on its own: 19 of 37 multiarch merges
still failed with "manifest not found" for the same digests we'd just
anchored.

Quay's manifest GC is per-repository. The anchor tag in ci-cache
protects the manifest copy that lives in ci-cache, but the same digest
in local-ai-backends is independently tracked and gets reaped because
nothing in local-ai-backends references it (push-by-digest=true leaves
it untagged). The merge then asks
`local-ai-backends@sha256:<digest>` and quay correctly says "not found"
in that repo, even though `ci-cache@sha256:<digest>` is alive and well.

Empirical confirmation against a live failed digest from v4.2.2:

  $ docker buildx imagetools inspect quay.io/go-skynet/ci-cache@sha256:05377fe6...
  Name: quay.io/go-skynet/ci-cache@sha256:05377fe6...
  MediaType: application/vnd.docker.distribution.manifest.v2+json
  $ docker buildx imagetools inspect quay.io/go-skynet/local-ai-backends@sha256:05377fe6...
  ERROR: ... not found

Switch the source of the quay merge step to ci-cache. The blobs the
manifest references are already accessible from local-ai-backends
(verified via direct registry HEAD: HTTP 200 from both repos — the
original push cross-mounted blobs at content-addressable storage time
and they outlive the per-repo manifest GC). buildx imagetools create
republishes the manifest into local-ai-backends, then writes the
user-facing manifest list pointing at it. End state is self-contained:
the published manifest list references child manifests by digest only,
no embedded reference to ci-cache.

Dockerhub merge step is unchanged. Dockerhub's GC isn't aggressive
enough to reap untagged manifests at the timescales we operate on
(verified: localai/localai-backends@<same digest> still resolves cleanly
after >24h).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-12 22:35:55 +00:00
244 changed files with 13581 additions and 1509 deletions

View File

@@ -112,6 +112,8 @@ Add a YAML anchor definition in the `## metas` section (around line 2-300). Look
Add image entries at the end of the file, following the pattern of similar backends such as `diffusers` or `chatterbox`. Include both `latest` (production) and `master` (development) tags. Add image entries at the end of the file, following the pattern of similar backends such as `diffusers` or `chatterbox`. Include both `latest` (production) and `master` (development) tags.
**Note on integrity:** OCI backends installed from a gallery whose `verification:` block is set are verified against a keyless-cosign policy before extraction; tarball/HTTP backends use the optional `sha256:` field. New backends do not need any extra YAML — the gallery-level `verification:` block covers every entry. See [.agents/backend-signing.md](backend-signing.md) for the producer-side CI step.
## 4. Update the Makefile ## 4. Update the Makefile
The Makefile needs to be updated in several places to support building and testing the new backend: The Makefile needs to be updated in several places to support building and testing the new backend:

View File

@@ -284,7 +284,17 @@ Also bump the expected-length count in `api_instructions_test.go` and add the na
### 3. `capabilities.js` symbol (for new model-config FLAG_* flags) ### 3. `capabilities.js` symbol (for new model-config FLAG_* flags)
If your feature needs a new `FLAG_*` usecase flag in `core/config/model_config.go` (so users can filter gallery models by it, and so `/v1/models` surfaces it), also declare the matching symbol in `core/http/react-ui/src/utils/capabilities.js`: If your feature needs a new `FLAG_*` usecase flag in `core/config/model_config.go` (so users can filter gallery models by it, and so `/v1/models` surfaces it), you need to update **all** of:
- `Usecase<Name>` string constant in `core/config/backend_capabilities.go`
- `UsecaseInfoMap` entry mapping the string to its flag + gRPC method
- `FLAG_<NAME>` bitmask in `core/config/model_config.go`
- `GetAllModelConfigUsecases()` map entry (otherwise the YAML loader silently ignores the string)
- `ModalityGroups` membership if the flag should affect `IsMultimodal()` (e.g. realtime_audio is in both speech-input and audio-output groups so a lone flag still reads as multimodal)
- `GuessUsecases()` branch listing the backends that own this capability
- `usecaseFilters` in `core/http/routes/ui_api.go` (drives the gallery filter dropdown)
- `Models.jsx` `FILTERS` array + matching `filters.<camelCase>` i18n key in `core/http/react-ui/public/locales/en/models.json`
- `core/http/react-ui/src/utils/capabilities.js`:
```js ```js
export const CAP_MY_CAPABILITY = 'FLAG_MY_CAPABILITY' export const CAP_MY_CAPABILITY = 'FLAG_MY_CAPABILITY'

120
.agents/backend-signing.md Normal file
View File

@@ -0,0 +1,120 @@
# Backend image signing & verification
LocalAI verifies backend OCI images against a per-gallery keyless-cosign
policy. This page documents the trust model, the producer side
(`.github/workflows/backend_merge.yml` in this repo), and the consumer
side (`pkg/oci/cosignverify` plus the gallery YAML).
## Trust model
- **Producer:** `.github/workflows/backend_merge.yml` signs each pushed
manifest list with `cosign sign --recursive` in keyless mode after
`docker buildx imagetools create`. The signing cert is issued by
Fulcio bound to the workflow's OIDC identity. There is no long-lived
signing key. `--recursive` signs both the manifest list and every
per-arch entry — needed because our consumer resolves a tag to a
per-arch manifest before checking signatures.
- **Storage:** Signatures are written as OCI 1.1 referrers
(`--registry-referrers-mode=oci-1-1`) in the new Sigstore bundle format
(current cosign releases do this by default; no `--new-bundle-format`
flag). No `:sha256-<hex>.sig` tag clutter.
- **Consumer:** `pkg/oci/cosignverify` discovers the bundle via the
referrers API, hands it to `sigstore-go`, and verifies it against the
policy declared in the gallery YAML (`Gallery.Verification`).
- **Revocation:** Keyless cosign certs are ephemeral (10-minute Fulcio
validity), so revocation is policy-side, not CA-side. The gallery's
`verification.not_before` (RFC3339) is the kill-switch — advance it to
invalidate every signature produced before a known compromise window.
## Producer setup
`backend_merge.yml` is the workflow that joins per-arch digests into the
multi-arch manifest list users actually pull, so it's also the right place
to sign. The job needs:
- `permissions: { id-token: write, contents: read }` at the job level so
the runner can exchange its GitHub OIDC token for a Fulcio cert.
- `sigstore/cosign-installer@v3` step (current cosign releases already
default to the new bundle format).
- After each `docker buildx imagetools create`, resolve the resulting
list digest with `docker buildx imagetools inspect <tag> --format
'{{.Manifest.Digest}}'` and sign:
```sh
cosign sign --yes --recursive \
--registry-referrers-mode=oci-1-1 \
"${REGISTRY_REPO}@${DIGEST}"
```
Sign by digest, never by tag — signing by tag binds the signature to
whatever the tag points at *now*, and a subsequent tag push orphans it.
`backend_build_darwin.yml` builds and pushes single-arch darwin images
that bypass the manifest-list merge. If/when those entries get a gallery
`verification:` policy, the equivalent cosign step has to land there
too.
## Consumer setup (in `mudler/LocalAI` gallery YAML)
Once CI is signing, add a `verification:` block to the backend gallery
entry (`backend/index.yaml`):
```yaml
- name: localai
url: github:mudler/LocalAI/backend/index.yaml@master
verification:
issuer: "https://token.actions.githubusercontent.com"
identity_regex: "^https://github\\.com/mudler/LocalAI/\\.github/workflows/backend_merge\\.yml@refs/heads/master$"
# Optional revocation cutoff; advance during incident response.
# not_before: "2026-06-01T00:00:00Z"
```
Identity matching pins the OIDC subject Fulcio issued the signing cert
to. Without this, any image signed by *anyone* with a Fulcio cert would
pass — the regex is what makes a signature mean "produced by our CI".
## Strict mode
Default behaviour: OCI backends without a `verification:` block install
with a warning (logs include `installing OCI backend without signature
verification`). Tarball/HTTP backends without a `sha256` field log a
similar warning.
For production, set `LOCALAI_REQUIRE_BACKEND_INTEGRITY=1` (or pass
`--require-backend-integrity` to `local-ai run` / `local-ai backends
install` / `local-ai models install`). The warning becomes a hard error
and unverifiable backends refuse to install.
## Revocation playbook
If `backend_merge.yml` (or any workflow with `id-token: write`) is
compromised and we've shipped malicious signed images:
1. **Identify the compromise window.** Find the earliest IntegratedTime
from the bad signatures (Rekor search by `subject` filter).
2. **Set `verification.not_before`** in `backend/index.yaml` to a
timestamp just *after* that window's start.
3. **Push the YAML.** Deployed LocalAI instances pick it up on next
gallery refresh (1-hour cache in `core/gallery/gallery.go`).
4. **Fix the underlying compromise** in the workflow and re-sign images
with the new build, which will have IntegratedTime > `not_before`.
5. **Optional:** for absolute decisiveness, also rotate to a new
workflow path (`backend_merge_v2.yml`) and update `identity_regex`.
## Where the code lives
- `pkg/oci/cosignverify/` — verifier, policy, OCI referrer fetch, NotBefore enforcement.
- `pkg/downloader/uri.go``WithImageVerifier` option threaded through `DownloadFileWithContext`.
- `core/gallery/backends.go``backendDownloadOptions` builds the verifier from the gallery's policy.
- `core/config/gallery.go``Gallery.Verification` YAML schema.
- `core/cli/run.go`, `core/cli/backends.go`, `core/cli/models.go``--require-backend-integrity` flag propagation.
- `.github/workflows/backend_merge.yml` — producer-side `cosign sign --recursive` after each multi-arch manifest list push.
## Out of scope (follow-ups)
- **Signing the gallery YAML itself.** The index is fetched over HTTPS
from GitHub; we trust the host. A cosign blob signature on the YAML
would close that gap but adds key-management overhead. Revisit this
page if/when added.
- **Tarball/HTTP backend signing.** Cosign can sign arbitrary blobs, but
for now non-OCI backends keep using the `sha256:` field in YAML.

View File

@@ -61,6 +61,12 @@ Always check `llama.cpp` for new model configuration options that should be supp
- `reasoning_format` - Reasoning format options - `reasoning_format` - Reasoning format options
- Any new flags or parameters - Any new flags or parameters
### Speculative Decoding Types
The `spec_type` option in `grpc-server.cpp` delegates to upstream's `common_speculative_types_from_names()`, so new speculative types added to the `common_speculative_type_from_name` map in `common/speculative.cpp` are picked up automatically with no code changes - only docs need an entry in `docs/content/advanced/model-configuration.md`. Current values: `none`, `draft-simple`, `draft-eagle3`, `draft-mtp`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, `ngram-mod`, `ngram-cache`.
`draft-mtp` (Multi-Token Prediction, [ggml-org/llama.cpp#22673](https://github.com/ggml-org/llama.cpp/pull/22673)) does not need a separate draft GGUF: when `spec_type` includes `draft-mtp` and `draftmodel` is empty, the upstream server creates an MTP context off the target model itself. LocalAI's gRPC layer needs no changes for this — it works through the existing `params.speculative.types` plumbing and the derived `cparams.n_rs_seq = params.speculative.need_n_rs_seq()` in `common_context_params_to_llama`.
### Implementation Guidelines ### Implementation Guidelines
1. **Feature Parity**: Always aim for feature parity with llama.cpp's implementation 1. **Feature Parity**: Always aim for feature parity with llama.cpp's implementation

View File

@@ -278,6 +278,19 @@ include:
dockerfile: "./backend/Dockerfile.python" dockerfile: "./backend/Dockerfile.python"
context: "./" context: "./"
ubuntu-version: '2404' ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-liquid-audio'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "liquid-audio"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas' - build-type: 'cublas'
cuda-major-version: "12" cuda-major-version: "12"
cuda-minor-version: "8" cuda-minor-version: "8"
@@ -808,6 +821,19 @@ include:
dockerfile: "./backend/Dockerfile.python" dockerfile: "./backend/Dockerfile.python"
context: "./" context: "./"
ubuntu-version: '2404' ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-liquid-audio'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "liquid-audio"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas' - build-type: 'cublas'
cuda-major-version: "13" cuda-major-version: "13"
cuda-minor-version: "0" cuda-minor-version: "0"
@@ -1088,6 +1114,19 @@ include:
backend: "vibevoice" backend: "vibevoice"
dockerfile: "./backend/Dockerfile.python" dockerfile: "./backend/Dockerfile.python"
context: "./" context: "./"
- build-type: 'l4t'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-liquid-audio'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
ubuntu-version: '2404'
backend: "liquid-audio"
dockerfile: "./backend/Dockerfile.python"
context: "./"
- build-type: 'l4t' - build-type: 'l4t'
cuda-major-version: "13" cuda-major-version: "13"
cuda-minor-version: "0" cuda-minor-version: "0"
@@ -1729,6 +1768,19 @@ include:
dockerfile: "./backend/Dockerfile.python" dockerfile: "./backend/Dockerfile.python"
context: "./" context: "./"
ubuntu-version: '2404' ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-liquid-audio'
runs-on: 'ubuntu-latest'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "liquid-audio"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas' - build-type: 'hipblas'
cuda-major-version: "" cuda-major-version: ""
cuda-minor-version: "" cuda-minor-version: ""
@@ -2177,6 +2229,19 @@ include:
dockerfile: "./backend/Dockerfile.python" dockerfile: "./backend/Dockerfile.python"
context: "./" context: "./"
ubuntu-version: '2404' ubuntu-version: '2404'
- build-type: 'intel'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-liquid-audio'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "liquid-audio"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'intel' - build-type: 'intel'
cuda-major-version: "" cuda-major-version: ""
cuda-minor-version: "" cuda-minor-version: ""
@@ -3503,6 +3568,20 @@ include:
dockerfile: "./backend/Dockerfile.python" dockerfile: "./backend/Dockerfile.python"
context: "./" context: "./"
ubuntu-version: '2404' ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-liquid-audio'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "liquid-audio"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: '' - build-type: ''
cuda-major-version: "" cuda-major-version: ""
cuda-minor-version: "" cuda-minor-version: ""

View File

@@ -31,6 +31,13 @@ on:
jobs: jobs:
merge: merge:
runs-on: ubuntu-latest runs-on: ubuntu-latest
# id-token: write is required for keyless cosign — the workflow
# exchanges the GitHub OIDC token for a short-lived Fulcio cert that
# signs each pushed manifest. Without this permission the runner
# cannot mint the token, and `cosign sign` fails with "no token".
permissions:
contents: read
id-token: write
env: env:
quay_username: ${{ secrets.quayUsername }} quay_username: ${{ secrets.quayUsername }}
steps: steps:
@@ -57,6 +64,16 @@ jobs:
- name: Set up Docker Buildx - name: Set up Docker Buildx
uses: docker/setup-buildx-action@master uses: docker/setup-buildx-action@master
# cosign signs each pushed manifest list with --recursive so the
# index and every per-arch entry get an attached Sigstore bundle.
# Recent cosign releases always emit the new bundle format, so
# there's no extra CLI flag to opt into it.
- name: Install cosign
if: github.event_name != 'pull_request'
uses: sigstore/cosign-installer@v3
with:
cosign-release: 'v2.4.1'
- name: Login to DockerHub - name: Login to DockerHub
if: github.event_name != 'pull_request' if: github.event_name != 'pull_request'
uses: docker/login-action@v4 uses: docker/login-action@v4
@@ -88,6 +105,25 @@ jobs:
latest=${{ inputs.tag-latest }} latest=${{ inputs.tag-latest }}
suffix=${{ inputs.tag-suffix }},onlatest=true suffix=${{ inputs.tag-suffix }},onlatest=true
# Source from ci-cache, not local-ai-backends.
#
# The build job pushes per-arch manifests to local-ai-backends with
# push-by-digest=true (no tag), then anchors a tagged copy into
# ci-cache so the manifest can be retrieved hours later when this
# merge runs. Quay's manifest GC, however, is per-repository: the
# anchor tag in ci-cache protects the manifest there, but the same
# digest in local-ai-backends has no tag in *that* repo and gets
# reaped independently. Sourcing local-ai-backends@<digest> here
# then fails with "manifest not found" — exactly the regression
# we hit on v4.2.2 (19/37 multiarch merges failed).
#
# ci-cache@<digest> resolves because we anchored it there. buildx
# imagetools create copies the manifest into local-ai-backends
# (cross-repo within the same registry, blobs already cross-mounted
# from the original push so no transfer needed) and publishes the
# manifest list with the user-facing tags. The resulting manifest
# list is fully self-contained in local-ai-backends — child digests
# only, no embedded references to ci-cache.
- name: Create manifest list and push (quay) - name: Create manifest list and push (quay)
if: github.event_name != 'pull_request' if: github.event_name != 'pull_request'
working-directory: /tmp/digests working-directory: /tmp/digests
@@ -101,11 +137,25 @@ jobs:
' <<< "$DOCKER_METADATA_OUTPUT_JSON") ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
if [ -z "$tags" ]; then if [ -z "$tags" ]; then
echo "No quay.io tags from docker/metadata-action; skipping quay merge" echo "No quay.io tags from docker/metadata-action; skipping quay merge"
else exit 0
# shellcheck disable=SC2086
docker buildx imagetools create $tags \
$(printf 'quay.io/go-skynet/local-ai-backends@sha256:%s ' *)
fi fi
# shellcheck disable=SC2086
docker buildx imagetools create $tags \
$(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
# Resolve the manifest-list digest (any tag points at it) so
# cosign can sign by digest. Signing by tag would leave the
# signature orphaned the next time the tag moves.
first_tag=$(jq -cr '
.tags | map(select(startswith("quay.io/"))) | .[0]
' <<< "$DOCKER_METADATA_OUTPUT_JSON")
digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
# --recursive walks the list and signs every per-arch entry
# too — clients that resolve a tag to a platform-specific
# manifest before checking signatures need the per-arch
# signatures, not just the list-level one.
cosign sign --yes --recursive \
--registry-referrers-mode=oci-1-1 \
"quay.io/go-skynet/local-ai-backends@${digest}"
- name: Create manifest list and push (dockerhub) - name: Create manifest list and push (dockerhub)
if: github.event_name != 'pull_request' if: github.event_name != 'pull_request'
@@ -120,11 +170,18 @@ jobs:
' <<< "$DOCKER_METADATA_OUTPUT_JSON") ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
if [ -z "$tags" ]; then if [ -z "$tags" ]; then
echo "No dockerhub tags from docker/metadata-action; skipping dockerhub merge" echo "No dockerhub tags from docker/metadata-action; skipping dockerhub merge"
else exit 0
# shellcheck disable=SC2086
docker buildx imagetools create $tags \
$(printf 'localai/localai-backends@sha256:%s ' *)
fi fi
# shellcheck disable=SC2086
docker buildx imagetools create $tags \
$(printf 'localai/localai-backends@sha256:%s ' *)
first_tag=$(jq -cr '
.tags | map(select(startswith("localai/"))) | .[0]
' <<< "$DOCKER_METADATA_OUTPUT_JSON")
digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
cosign sign --yes --recursive \
--registry-referrers-mode=oci-1-1 \
"localai/localai-backends@${digest}"
- name: Inspect manifest - name: Inspect manifest
if: github.event_name != 'pull_request' if: github.event_name != 'pull_request'

View File

@@ -151,7 +151,11 @@
ubuntu-codename: 'noble' ubuntu-codename: 'noble'
core-image-merge: core-image-merge:
if: github.repository == 'mudler/LocalAI' # !cancelled(): without it, GHA's default `needs:` cascade skips the
# merge whenever any matrix cell of the parent build fails or is
# cancelled. Same fix as backend.yml's merge jobs — we still want to
# publish the manifest list for tag-suffixes whose legs all succeeded.
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: core-image-build needs: core-image-build
uses: ./.github/workflows/image_merge.yml uses: ./.github/workflows/image_merge.yml
with: with:
@@ -164,7 +168,7 @@
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }} quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
gpu-vulkan-image-merge: gpu-vulkan-image-merge:
if: github.repository == 'mudler/LocalAI' if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: core-image-build needs: core-image-build
uses: ./.github/workflows/image_merge.yml uses: ./.github/workflows/image_merge.yml
with: with:
@@ -175,7 +179,91 @@
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }} dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }} quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }} quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
# Single-arch server-image merges. Same conceptual fix as the backend
# singletons in PR #9781: image_build.yml pushes by canonical digest
# only, so without a downstream merge step there's no tag for consumers
# (no :latest-gpu-nvidia-cuda-12, no :v<X>-gpu-nvidia-cuda-12, etc.).
# Each merge job needs only its parent build matrix and is filtered by
# tag-suffix in image_merge.yml's artifact-download pattern.
gpu-nvidia-cuda-12-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: core-image-build
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
gpu-nvidia-cuda-13-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: core-image-build
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
gpu-intel-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: core-image-build
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-gpu-intel'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
gpu-hipblas-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: hipblas-jobs
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-gpu-hipblas'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
nvidia-l4t-arm64-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: gh-runner
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
nvidia-l4t-arm64-cuda-13-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: gh-runner
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-cuda-13'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
gh-runner: gh-runner:
if: github.repository == 'mudler/LocalAI' if: github.repository == 'mudler/LocalAI'
uses: ./.github/workflows/image_build.yml uses: ./.github/workflows/image_build.yml

View File

@@ -106,6 +106,7 @@ jobs:
type=ref,event=branch type=ref,event=branch
type=semver,pattern={{raw}} type=semver,pattern={{raw}}
type=sha type=sha
type=raw,value={{branch}}-{{date 'X'}}-{{sha}},enable={{is_default_branch}}
flavor: | flavor: |
latest=${{ inputs.tag-latest }} latest=${{ inputs.tag-latest }}
suffix=${{ inputs.tag-suffix }},onlatest=true suffix=${{ inputs.tag-suffix }},onlatest=true
@@ -185,11 +186,28 @@ jobs:
digest="${{ steps.build.outputs.digest }}" digest="${{ steps.build.outputs.digest }}"
touch "/tmp/digests/${digest#sha256:}" touch "/tmp/digests/${digest#sha256:}"
# See .github/scripts/anchor-digest-in-cache.sh for why this is needed
# and how it interacts with image_merge.yml's cleanup step. Mirrors the
# same anchor in backend_build.yml — quay's per-repo manifest GC reaps
# untagged manifests in local-ai before the merge runs.
- name: Anchor digest in ci-cache so quay GC won't reap before merge
if: github.event_name != 'pull_request'
env:
TAG_SUFFIX: ${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}
PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }}
DIGEST: ${{ steps.build.outputs.digest }}
SOURCE_IMAGE: quay.io/go-skynet/local-ai
run: .github/scripts/anchor-digest-in-cache.sh
- name: Upload digest artifact - name: Upload digest artifact
if: github.event_name != 'pull_request' if: github.event_name != 'pull_request'
uses: actions/upload-artifact@v7 uses: actions/upload-artifact@v7
with: with:
name: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}-${{ inputs.platform-tag }} # `--` separator + 'single' placeholder for empty platform-tag
# same pattern as backend_build.yml. Prevents prefix collisions
# in the merge-side glob (e.g. -nvidia-l4t-arm64 is a prefix of
# -nvidia-l4t-arm64-cuda-13).
name: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}--${{ inputs.platform-tag || 'single' }}
path: /tmp/digests/* path: /tmp/digests/*
if-no-files-found: error if-no-files-found: error
retention-days: 1 retention-days: 1

View File

@@ -33,10 +33,22 @@ jobs:
env: env:
quay_username: ${{ secrets.quayUsername }} quay_username: ${{ secrets.quayUsername }}
steps: steps:
# Sparse checkout: needed for .github/scripts/ (the keepalive cleanup
# script). Skips the rest of the source tree.
- name: Checkout (.github/scripts only)
uses: actions/checkout@v6
with:
sparse-checkout: |
.github/scripts
sparse-checkout-cone-mode: false
- name: Download digests - name: Download digests
uses: actions/download-artifact@v8 uses: actions/download-artifact@v8
with: with:
pattern: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}-* # `--` separator anchors the glob so we don't over-match sibling
# tag-suffixes (e.g. -nvidia-l4t-arm64 vs -nvidia-l4t-arm64-cuda-13).
# Must stay in sync with image_build.yml's upload-artifact name.
pattern: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}--*
merge-multiple: true merge-multiple: true
path: /tmp/digests path: /tmp/digests
@@ -68,10 +80,18 @@ jobs:
type=ref,event=branch type=ref,event=branch
type=semver,pattern={{raw}} type=semver,pattern={{raw}}
type=sha type=sha
type=raw,value={{branch}}-{{date 'X'}}-{{sha}},enable={{is_default_branch}}
flavor: | flavor: |
latest=${{ inputs.tag-latest }} latest=${{ inputs.tag-latest }}
suffix=${{ inputs.tag-suffix }},onlatest=true suffix=${{ inputs.tag-suffix }},onlatest=true
# Source from ci-cache, not local-ai. See backend_merge.yml for the
# detailed rationale — quay's manifest GC is per-repository, so the
# untagged digest in local-ai gets reaped while the same content lives
# tagged under ci-cache (anchored by image_build.yml). buildx imagetools
# create copies the manifest into local-ai (blobs already cross-mounted)
# and publishes the manifest list with user-facing tags. End state in
# local-ai is self-contained; no embedded reference to ci-cache.
- name: Create manifest list and push (quay) - name: Create manifest list and push (quay)
working-directory: /tmp/digests working-directory: /tmp/digests
run: | run: |
@@ -82,7 +102,7 @@ jobs:
else else
# shellcheck disable=SC2086 # shellcheck disable=SC2086
docker buildx imagetools create $tags \ docker buildx imagetools create $tags \
$(printf 'quay.io/go-skynet/local-ai@sha256:%s ' *) $(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
fi fi
- name: Create manifest list and push (dockerhub) - name: Create manifest list and push (dockerhub)
@@ -107,6 +127,15 @@ jobs:
docker buildx imagetools inspect "$first_tag" docker buildx imagetools inspect "$first_tag"
fi fi
# See .github/scripts/cleanup-keepalive-tags.sh for the best-effort
# semantics — fails soft when the registry credential isn't OAuth-scoped.
- name: Cleanup keepalive tags in ci-cache
if: github.event_name != 'pull_request' && success()
env:
TAG_SUFFIX: ${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}
QUAY_TOKEN: ${{ secrets.quayPassword }}
run: .github/scripts/cleanup-keepalive-tags.sh
- name: Job summary - name: Job summary
run: | run: |
set -euo pipefail set -euo pipefail

View File

@@ -28,6 +28,7 @@ jobs:
qwen-asr: ${{ steps.detect.outputs.qwen-asr }} qwen-asr: ${{ steps.detect.outputs.qwen-asr }}
nemo: ${{ steps.detect.outputs.nemo }} nemo: ${{ steps.detect.outputs.nemo }}
voxcpm: ${{ steps.detect.outputs.voxcpm }} voxcpm: ${{ steps.detect.outputs.voxcpm }}
liquid-audio: ${{ steps.detect.outputs.liquid-audio }}
llama-cpp-quantization: ${{ steps.detect.outputs.llama-cpp-quantization }} llama-cpp-quantization: ${{ steps.detect.outputs.llama-cpp-quantization }}
llama-cpp: ${{ steps.detect.outputs.llama-cpp }} llama-cpp: ${{ steps.detect.outputs.llama-cpp }}
ik-llama-cpp: ${{ steps.detect.outputs.ik-llama-cpp }} ik-llama-cpp: ${{ steps.detect.outputs.ik-llama-cpp }}
@@ -447,6 +448,32 @@ jobs:
run: | run: |
make --jobs=5 --output-sync=target -C backend/python/voxcpm make --jobs=5 --output-sync=target -C backend/python/voxcpm
make --jobs=5 --output-sync=target -C backend/python/voxcpm test make --jobs=5 --output-sync=target -C backend/python/voxcpm test
# liquid-audio: LFM2.5-Audio any-to-any backend. The CI smoke test
# exercises Health() and LoadModel(mode:finetune) — fine-tune mode
# short-circuits before pulling weights (backend.py:192), so no
# HuggingFace download or GPU is needed. The full-inference path is
# gated on LIQUID_AUDIO_MODEL_ID, which we don't set here.
tests-liquid-audio:
needs: detect-changes
if: needs.detect-changes.outputs.liquid-audio == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install -y build-essential ffmpeg
sudo apt-get install -y ca-certificates cmake curl patch python3-pip
# Install UV
curl -LsSf https://astral.sh/uv/install.sh | sh
pip install --user --no-cache-dir grpcio-tools==1.64.1
- name: Test liquid-audio
run: |
make --jobs=5 --output-sync=target -C backend/python/liquid-audio
make --jobs=5 --output-sync=target -C backend/python/liquid-audio test
tests-llama-cpp-quantization: tests-llama-cpp-quantization:
needs: detect-changes needs: detect-changes
if: needs.detect-changes.outputs.llama-cpp-quantization == 'true' || needs.detect-changes.outputs.run-all == 'true' if: needs.detect-changes.outputs.llama-cpp-quantization == 'true' || needs.detect-changes.outputs.run-all == 'true'

3
.gitignore vendored
View File

@@ -77,3 +77,6 @@ local-backends/
tests/e2e-ui/ui-test-server tests/e2e-ui/ui-test-server
core/http/react-ui/playwright-report/ core/http/react-ui/playwright-report/
core/http/react-ui/test-results/ core/http/react-ui/test-results/
# Local worktrees
.worktrees/

View File

@@ -46,8 +46,52 @@ linters:
msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.Fail. See .agents/coding-style.md.' msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.Fail. See .agents/coding-style.md.'
- pattern: '^t\.FailNow$' - pattern: '^t\.FailNow$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.FailNow. See .agents/coding-style.md.' msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.FailNow. See .agents/coding-style.md.'
# In-process config should flow through ApplicationConfig / kong-bound
# CLI flags, not via os.Getenv. The CLI layer is the legitimate
# env→struct boundary (kong's `env:"..."` tag); anything deeper that
# reads env directly leaks process state into business logic and
# makes flags impossible to test or override per-request. Backend
# subprocesses, the system/capabilities probe, and a few places that
# read non-LocalAI env vars (HOME, PATH, AUTH_TOKEN passed by parent)
# are exempt — see linters.exclusions.rules below.
- pattern: '^os\.(Getenv|LookupEnv|Environ)$'
msg: 'Plumb config through ApplicationConfig (or the relevant CLI struct) instead of reading env directly. CLI entry points (core/cli/) bind env vars via kong''s `env:` tag — that is the only sanctioned env→struct boundary. See .agents/coding-style.md.'
exclusions: exclusions:
paths: paths:
# Upstream whisper.cpp source tree fetched by the whisper backend Makefile. # Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
- 'backend/go/whisper/sources' - 'backend/go/whisper/sources'
- 'docs/' - 'docs/'
rules:
# CLI entry points: kong's `env:"..."` tag is the legitimate env→struct
# boundary, and a handful of subcommands legitimately propagate values
# to spawned subprocesses (LLAMACPP_GRPC_SERVERS, MLX hostfile, ...).
- path: ^core/cli/
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
# Backend subprocesses are independent binaries with their own env
# surface; they're not "in-process config" of the LocalAI server.
- path: ^backend/
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
# System capability probe reads HOME, PATH-style vars to discover
# GPUs, default paths, etc. — not LocalAI config.
- path: ^pkg/system/
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
# gRPC server reads AUTH_TOKEN passed in by the parent process at spawn
# time; model.Loader sets/inherits env to communicate with subprocesses.
- path: ^pkg/grpc/
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
- path: ^pkg/model/
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
# Top-level main binaries (local-ai, launcher) are entry points.
- path: ^cmd/
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
# Tests legitimately read $HOME, $TMPDIR, and gating env vars
# (LOCALAI_COSIGN_LIVE, etc.) to skip live-network specs.
- path: _test\.go$
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]

View File

@@ -31,6 +31,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
| [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends | | [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
| [.agents/adding-gallery-models.md](.agents/adding-gallery-models.md) | Adding GGUF models from HuggingFace to the model gallery | | [.agents/adding-gallery-models.md](.agents/adding-gallery-models.md) | Adding GGUF models from HuggingFace to the model gallery |
| [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) | LocalAI Assistant chat modality — adding admin tools to the in-process MCP server, editing skill prompts, keeping REST + MCP + skills in sync | | [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) | LocalAI Assistant chat modality — adding admin tools to the in-process MCP server, editing skill prompts, keeping REST + MCP + skills in sync |
| [.agents/backend-signing.md](.agents/backend-signing.md) | Backend OCI image signing (keyless cosign + sigstore-go) — producer-side CI setup, consumer-side gallery `verification:` block, strict mode (`LOCALAI_REQUIRE_BACKEND_INTEGRITY`), revocation via `not_before` |
## Quick Reference ## Quick Reference

View File

@@ -1,5 +1,5 @@
# Disable parallel execution for backend builds # Disable parallel execution for backend builds
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin .NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
GOCMD=go GOCMD=go
GOTEST=$(GOCMD) test GOTEST=$(GOCMD) test
@@ -463,6 +463,7 @@ prepare-test-extra: protogen-python
$(MAKE) -C backend/python/vllm-omni $(MAKE) -C backend/python/vllm-omni
$(MAKE) -C backend/python/sglang $(MAKE) -C backend/python/sglang
$(MAKE) -C backend/python/vibevoice $(MAKE) -C backend/python/vibevoice
$(MAKE) -C backend/python/liquid-audio
$(MAKE) -C backend/python/moonshine $(MAKE) -C backend/python/moonshine
$(MAKE) -C backend/python/pocket-tts $(MAKE) -C backend/python/pocket-tts
$(MAKE) -C backend/python/qwen-tts $(MAKE) -C backend/python/qwen-tts
@@ -488,6 +489,7 @@ test-extra: prepare-test-extra
$(MAKE) -C backend/python/vllm test $(MAKE) -C backend/python/vllm test
$(MAKE) -C backend/python/vllm-omni test $(MAKE) -C backend/python/vllm-omni test
$(MAKE) -C backend/python/vibevoice test $(MAKE) -C backend/python/vibevoice test
$(MAKE) -C backend/python/liquid-audio test
$(MAKE) -C backend/python/moonshine test $(MAKE) -C backend/python/moonshine test
$(MAKE) -C backend/python/pocket-tts test $(MAKE) -C backend/python/pocket-tts test
$(MAKE) -C backend/python/qwen-tts test $(MAKE) -C backend/python/qwen-tts test
@@ -1092,6 +1094,7 @@ BACKEND_SGLANG = sglang|python|.|false|true
BACKEND_DIFFUSERS = diffusers|python|.|--progress=plain|true BACKEND_DIFFUSERS = diffusers|python|.|--progress=plain|true
BACKEND_CHATTERBOX = chatterbox|python|.|false|true BACKEND_CHATTERBOX = chatterbox|python|.|false|true
BACKEND_VIBEVOICE = vibevoice|python|.|--progress=plain|true BACKEND_VIBEVOICE = vibevoice|python|.|--progress=plain|true
BACKEND_LIQUID_AUDIO = liquid-audio|python|.|--progress=plain|true
BACKEND_MOONSHINE = moonshine|python|.|false|true BACKEND_MOONSHINE = moonshine|python|.|false|true
BACKEND_POCKET_TTS = pocket-tts|python|.|false|true BACKEND_POCKET_TTS = pocket-tts|python|.|false|true
BACKEND_QWEN_TTS = qwen-tts|python|.|false|true BACKEND_QWEN_TTS = qwen-tts|python|.|false|true
@@ -1169,6 +1172,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SGLANG)))
$(eval $(call generate-docker-build-target,$(BACKEND_DIFFUSERS))) $(eval $(call generate-docker-build-target,$(BACKEND_DIFFUSERS)))
$(eval $(call generate-docker-build-target,$(BACKEND_CHATTERBOX))) $(eval $(call generate-docker-build-target,$(BACKEND_CHATTERBOX)))
$(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE))) $(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE)))
$(eval $(call generate-docker-build-target,$(BACKEND_LIQUID_AUDIO)))
$(eval $(call generate-docker-build-target,$(BACKEND_MOONSHINE))) $(eval $(call generate-docker-build-target,$(BACKEND_MOONSHINE)))
$(eval $(call generate-docker-build-target,$(BACKEND_POCKET_TTS))) $(eval $(call generate-docker-build-target,$(BACKEND_POCKET_TTS)))
$(eval $(call generate-docker-build-target,$(BACKEND_QWEN_TTS))) $(eval $(call generate-docker-build-target,$(BACKEND_QWEN_TTS)))
@@ -1197,7 +1201,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
docker-save-%: backend-images docker-save-%: backend-images
docker save local-ai-backend:$* -o backend-images/$*.tar docker save local-ai-backend:$* -o backend-images/$*.tar
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
######################################################## ########################################################
### Mock Backend for E2E Tests ### Mock Backend for E2E Tests

View File

@@ -48,6 +48,11 @@ service Backend {
rpc AudioTransform(AudioTransformRequest) returns (AudioTransformResult) {} rpc AudioTransform(AudioTransformRequest) returns (AudioTransformResult) {}
rpc AudioTransformStream(stream AudioTransformFrameRequest) returns (stream AudioTransformFrameResponse) {} rpc AudioTransformStream(stream AudioTransformFrameRequest) returns (stream AudioTransformFrameResponse) {}
// AudioToAudioStream is the bidirectional any-to-any S2S RPC. Backends
// that load a speech-to-speech model consume input audio frames and emit
// interleaved audio + transcript + tool-call deltas as typed events.
// Backends without S2S support return UNIMPLEMENTED.
rpc AudioToAudioStream(stream AudioToAudioRequest) returns (stream AudioToAudioResponse) {}
rpc ModelMetadata(ModelOptions) returns (ModelMetadataResponse) {} rpc ModelMetadata(ModelOptions) returns (ModelMetadataResponse) {}
@@ -768,6 +773,93 @@ message AudioTransformFrameResponse {
int64 frame_index = 2; int64 frame_index = 2;
} }
// === AudioToAudioStream messages =========================================
//
// Bidirectional stream between the LocalAI core and an any-to-any audio
// model. The client opens the stream with a Config payload, then alternates
// Frame (input audio) and Control (turn boundaries, function-call results,
// session updates) payloads. The server streams back typed events: audio
// frames carry PCM in `pcm`; transcript / tool-call deltas carry JSON in
// `meta`; the stream ends with a `response.done` (success) or `error` event.
message AudioToAudioRequest {
oneof payload {
AudioToAudioConfig config = 1;
AudioToAudioFrame frame = 2;
AudioToAudioControl control = 3;
}
}
message AudioToAudioConfig {
// PCM format for client→server audio. 0 => backend default
// (16 kHz for the LFM2-Audio Conformer encoder).
int32 input_sample_rate = 1;
// Preferred server→client audio rate. 0 => backend default
// (24 kHz for the LFM2-Audio vocoder).
int32 output_sample_rate = 2;
// Optional system prompt override. Empty => backend chooses based on
// mode (e.g. "Respond with interleaved text and audio.").
string system_prompt = 3;
// Optional baked-voice id. Models that only ship a fixed set of
// voices (e.g. LFM2-Audio: us_male/us_female/uk_male/uk_female) match
// this against their voice table; an empty string keeps the default.
string voice = 4;
// JSON-encoded array of tool definitions in OpenAI Chat Completions
// format. Empty => no tools.
string tools = 5;
// Free-form sampling / decoding parameters (temperature, top_k,
// max_new_tokens, audio_top_k, etc).
map<string, string> params = 6;
// True => reset any session-scoped state before processing further
// frames on this stream. The first Config implicitly resets.
bool reset = 7;
}
message AudioToAudioFrame {
// Raw PCM s16le mono at config.input_sample_rate. Empty pcm + end_of_input
// is a valid "user finished speaking" marker without trailing audio.
bytes pcm = 1;
// Marks the last frame of a user turn. The backend may begin emitting
// a response immediately after seeing this.
bool end_of_input = 2;
}
message AudioToAudioControl {
// Free-form control event names. Initial set:
// "input_audio_buffer.commit" — user finished speaking
// "response.cancel" — abort in-flight generation
// "conversation.item.create" — inject a non-audio item (e.g.
// function_call_output as JSON in
// `payload`)
// "session.update" — re-configure mid-stream
string event = 1;
// Event-specific JSON payload.
bytes payload = 2;
}
message AudioToAudioResponse {
// Event identifies what this frame carries. Mirrors the OpenAI Realtime
// API server-event names where applicable. Initial set:
// "response.audio.delta"
// "response.audio_transcript.delta"
// "response.function_call_arguments.delta"
// "response.function_call_arguments.done"
// "response.done"
// "error"
string event = 1;
// Populated when event = response.audio.delta.
bytes pcm = 2;
// Populated alongside pcm to identify its rate. 0 => same as the
// session's negotiated output_sample_rate.
int32 sample_rate = 3;
// JSON payload for non-PCM events (transcript chunk, tool args, error
// body).
bytes meta = 4;
// Monotonic per-stream counter, useful for client reordering and
// debugging.
int64 sequence = 5;
}
message ModelMetadataResponse { message ModelMetadataResponse {
bool supports_thinking = 1; bool supports_thinking = 1;
string rendered_template = 2; // The rendered chat template with enable_thinking=true (empty if not applicable) string rendered_template = 2; // The rendered chat template with enable_thinking=true (empty if not applicable)

View File

@@ -1,10 +1,10 @@
# ds4 backend Makefile. # ds4 backend Makefile.
# #
# Upstream pin lives below as DS4_VERSION?= so the bump-deps bot # Upstream pin lives below as DS4_VERSION?=8d576642c39b9a2d782a80159ba84ef5a81c0b81
# (.github/bump_deps.sh) can find and update it - matches the # (.github/bump_deps.sh) can find and update it - matches the
# llama-cpp / ik-llama-cpp / turboquant convention. # llama-cpp / ik-llama-cpp / turboquant convention.
DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0 DS4_VERSION?=8d576642c39b9a2d782a80159ba84ef5a81c0b81
DS4_REPO?=https://github.com/antirez/ds4 DS4_REPO?=https://github.com/antirez/ds4
CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST)))) CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))

View File

@@ -1,5 +1,5 @@
IK_LLAMA_VERSION?=eb570eb96689c235933b813693ca28ab9d3d26de IK_LLAMA_VERSION?=b3d39cff8bffbd67296d6badd4076a1486a0715c
LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
CMAKE_ARGS?= CMAKE_ARGS?=

View File

@@ -1,5 +1,5 @@
LLAMA_VERSION?=1ec7ba0c14f33f17e980daeeda5f35b225d41994 LLAMA_VERSION?=1acee6bf8939948f9bcbf4b14034e4b475f06069
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
CMAKE_ARGS?= CMAKE_ARGS?=

View File

@@ -32,6 +32,7 @@
#include <grpcpp/health_check_service_interface.h> #include <grpcpp/health_check_service_interface.h>
#include <grpcpp/security/server_credentials.h> #include <grpcpp/security/server_credentials.h>
#include <regex> #include <regex>
#include <algorithm>
#include <atomic> #include <atomic>
#include <cstdlib> #include <cstdlib>
#include <fstream> #include <fstream>
@@ -450,6 +451,8 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// vector; the turboquant fork still uses the legacy scalar. The // vector; the turboquant fork still uses the legacy scalar. The
// LOCALAI_LEGACY_LLAMA_CPP_SPEC macro is injected by // LOCALAI_LEGACY_LLAMA_CPP_SPEC macro is injected by
// backend/cpp/turboquant/patch-grpc-server.sh for fork builds only. // backend/cpp/turboquant/patch-grpc-server.sh for fork builds only.
// Upstream renamed COMMON_SPECULATIVE_TYPE_DRAFT -> ..._DRAFT_SIMPLE
// in ggml-org/llama.cpp#22964; the fork still uses the old name.
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC #ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) { if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) {
params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT; params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT;
@@ -458,7 +461,7 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
const bool no_spec_type = params.speculative.types.empty() || const bool no_spec_type = params.speculative.types.empty() ||
(params.speculative.types.size() == 1 && params.speculative.types[0] == COMMON_SPECULATIVE_TYPE_NONE); (params.speculative.types.size() == 1 && params.speculative.types[0] == COMMON_SPECULATIVE_TYPE_NONE);
if (no_spec_type) { if (no_spec_type) {
params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT }; params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE };
} }
#endif #endif
} }
@@ -514,16 +517,27 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
params.warmup = true; params.warmup = true;
// no_op_offload: disable host tensor op offload (default: false) // no_op_offload: disable host tensor op offload (default: false)
params.no_op_offload = false; params.no_op_offload = false;
// kv_unified: enable unified KV cache (default: false) // kv_unified: enable unified KV cache. Upstream's server auto-enables this
params.kv_unified = false; // when the slot count is auto (-np <0), bumping n_parallel to 4 alongside.
// n_ctx_checkpoints: max context checkpoints per slot (default: 8) // LocalAI keeps n_parallel=1 by default, which would skip that auto path
params.n_ctx_checkpoints = 8; // and leave kv_unified=false. We flip the default to true here so the
// server-side prompt cache (cache_idle_slots) is actually usable on the
// llama memory fit fails if we don't provide a buffer for tensor overrides // single-slot path that LocalAI ships with: without it, idle slots are
const size_t ntbo = llama_max_tensor_buft_overrides(); // never persisted across requests and the prompt cache is dead weight.
while (params.tensor_buft_overrides.size() < ntbo) { // Users can opt out with `options: [ "kv_unified:false" ]`.
params.tensor_buft_overrides.push_back({nullptr, nullptr}); params.kv_unified = true;
} // n_ctx_checkpoints: max context checkpoints per slot. Match upstream's
// default (32); the previous LocalAI-specific 8 was unnecessarily tight
// and limits partial-prefix recovery without a clear memory rationale.
params.n_ctx_checkpoints = 32;
// cache_idle_slots: save and clear idle slot KV to the prompt cache on
// task switch. Upstream default is true; the server auto-disables it if
// kv_unified=false or cache_ram_mib=0, so flipping kv_unified above is
// what actually unlocks it.
params.cache_idle_slots = true;
// checkpoint_every_nt: create a context checkpoint every N tokens during
// prefill (-1 disables). Match upstream's default (8192).
params.checkpoint_every_nt = 8192;
// decode options. Options are in form optname:optvale, or if booleans only optname. // decode options. Options are in form optname:optvale, or if booleans only optname.
for (int i = 0; i < request->options_size(); i++) { for (int i = 0; i < request->options_size(); i++) {
@@ -682,9 +696,161 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
try { try {
params.n_ctx_checkpoints = std::stoi(optval_str); params.n_ctx_checkpoints = std::stoi(optval_str);
} catch (const std::exception& e) { } catch (const std::exception& e) {
// If conversion fails, keep default value (8) // If conversion fails, keep default value (32)
} }
} }
// --- server-side idle-slot prompt cache toggle (upstream --cache-idle-slots) ---
// Saves the slot's KV state into the host-side prompt cache on task
// switch so a later request with the same prefix can warm-load it.
// Auto-disabled by the server if kv_unified=false or cache_ram=0.
} else if (!strcmp(optname, "cache_idle_slots") || !strcmp(optname, "idle_slots_cache")) {
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
params.cache_idle_slots = true;
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
params.cache_idle_slots = false;
}
// --- prefill checkpoint cadence (upstream -cpent / --checkpoint-every-n-tokens) ---
// -1 disables checkpointing during prefill.
} else if (!strcmp(optname, "checkpoint_every_nt") || !strcmp(optname, "checkpoint_every_n_tokens")) {
if (optval != NULL) {
try {
params.checkpoint_every_nt = std::stoi(optval_str);
} catch (const std::exception& e) {
// If conversion fails, keep default value (8192)
}
}
// --- physical batch size (upstream -ub / --ubatch-size) ---
// Note: line ~482 already aliases n_ubatch to n_batch as a default; this
// option lets users decouple the two (useful for embeddings/rerank).
} else if (!strcmp(optname, "n_ubatch") || !strcmp(optname, "ubatch")) {
if (optval != NULL) {
try { params.n_ubatch = std::stoi(optval_str); } catch (...) {}
}
// --- main-model batch threads (upstream -tb / --threads-batch) ---
} else if (!strcmp(optname, "threads_batch") || !strcmp(optname, "n_threads_batch")) {
if (optval != NULL) {
try {
int n = std::stoi(optval_str);
if (n <= 0) n = (int)std::thread::hardware_concurrency();
params.cpuparams_batch.n_threads = n;
} catch (...) {}
}
// --- pooling type for embeddings (upstream --pooling) ---
} else if (!strcmp(optname, "pooling_type") || !strcmp(optname, "pooling")) {
if (optval != NULL) {
if (optval_str == "none") params.pooling_type = LLAMA_POOLING_TYPE_NONE;
else if (optval_str == "mean") params.pooling_type = LLAMA_POOLING_TYPE_MEAN;
else if (optval_str == "cls") params.pooling_type = LLAMA_POOLING_TYPE_CLS;
else if (optval_str == "last") params.pooling_type = LLAMA_POOLING_TYPE_LAST;
else if (optval_str == "rank") params.pooling_type = LLAMA_POOLING_TYPE_RANK;
// unknown values silently leave UNSPECIFIED (auto-detect)
}
// --- llama log verbosity threshold (upstream -lv / --verbosity) ---
} else if (!strcmp(optname, "verbosity")) {
if (optval != NULL) {
try { params.verbosity = std::stoi(optval_str); } catch (...) {}
}
// --- O_DIRECT model loading (upstream --direct-io) ---
} else if (!strcmp(optname, "direct_io") || !strcmp(optname, "use_direct_io")) {
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
params.use_direct_io = true;
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
params.use_direct_io = false;
}
// --- embedding normalization (upstream --embd-normalize) ---
// -1 none, 0 max-abs, 1 taxicab, 2 L2 (default), >2 p-norm
} else if (!strcmp(optname, "embd_normalize") || !strcmp(optname, "embedding_normalize")) {
if (optval != NULL) {
try { params.embd_normalize = std::stoi(optval_str); } catch (...) {}
}
// --- reasoning parser (upstream --reasoning-format) ---
// Picks the parser for <think> blocks emitted by reasoning models.
// none / auto / deepseek / deepseek-legacy
} else if (!strcmp(optname, "reasoning_format")) {
if (optval != NULL) {
if (optval_str == "none") params.reasoning_format = COMMON_REASONING_FORMAT_NONE;
else if (optval_str == "auto") params.reasoning_format = COMMON_REASONING_FORMAT_AUTO;
else if (optval_str == "deepseek") params.reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
else if (optval_str == "deepseek-legacy" || optval_str == "deepseek_legacy")
params.reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY;
// unknown values silently keep the upstream default (DEEPSEEK)
}
// --- reasoning budget (upstream --reasoning-budget) ---
// -1 unlimited, 0 disabled, >0 token budget for thinking blocks.
// Distinct from per-request `enable_thinking` (chat_template_kwargs).
} else if (!strcmp(optname, "enable_reasoning") || !strcmp(optname, "reasoning_budget")) {
if (optval != NULL) {
try { params.enable_reasoning = std::stoi(optval_str); } catch (...) {}
}
// --- prefill assistant turn (upstream --no-prefill-assistant) ---
} else if (!strcmp(optname, "prefill_assistant")) {
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
params.prefill_assistant = true;
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
params.prefill_assistant = false;
}
// --- mmproj GPU offload (upstream --no-mmproj-offload, inverted) ---
} else if (!strcmp(optname, "mmproj_use_gpu") || !strcmp(optname, "mmproj_offload")) {
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
params.mmproj_use_gpu = true;
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
params.mmproj_use_gpu = false;
}
// --- per-image vision token budget (upstream --image-min/max-tokens) ---
} else if (!strcmp(optname, "image_min_tokens")) {
if (optval != NULL) {
try { params.image_min_tokens = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "image_max_tokens")) {
if (optval != NULL) {
try { params.image_max_tokens = std::stoi(optval_str); } catch (...) {}
}
// --- main-model tensor buffer overrides (upstream --override-tensor) ---
// Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...
// Mirrors the existing `draft_override_tensor` parser below.
} else if (!strcmp(optname, "override_tensor") || !strcmp(optname, "tensor_buft_overrides")) {
ggml_backend_load_all();
std::map<std::string, ggml_backend_buffer_type_t> buft_list;
for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
auto * dev = ggml_backend_dev_get(i);
auto * buft = ggml_backend_dev_buffer_type(dev);
if (buft) {
buft_list[ggml_backend_buft_name(buft)] = buft;
}
}
static std::list<std::string> override_names;
std::string cur;
auto flush = [&](const std::string & spec) {
auto pos = spec.find('=');
if (pos == std::string::npos) return;
const std::string name = spec.substr(0, pos);
const std::string type = spec.substr(pos + 1);
auto it = buft_list.find(type);
if (it == buft_list.end()) return; // unknown buffer type: ignore
override_names.push_back(name);
params.tensor_buft_overrides.push_back(
{override_names.back().c_str(), it->second});
};
for (char c : optval_str) {
if (c == ',') { if (!cur.empty()) { flush(cur); cur.clear(); } }
else { cur.push_back(c); }
}
if (!cur.empty()) flush(cur);
// Speculative decoding options // Speculative decoding options
} else if (!strcmp(optname, "spec_type") || !strcmp(optname, "speculative_type")) { } else if (!strcmp(optname, "spec_type") || !strcmp(optname, "speculative_type")) {
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC #ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
@@ -701,16 +867,27 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// Upstream switched to a vector of types (comma-separated for multi-type // Upstream switched to a vector of types (comma-separated for multi-type
// chaining via common_speculative_types_from_names). We keep accepting a // chaining via common_speculative_types_from_names). We keep accepting a
// single value here, but also tolerate comma-separated lists. // single value here, but also tolerate comma-separated lists.
//
// ggml-org/llama.cpp#22964 also renamed the registered names from
// underscore- to dash-separated form, and replaced the bare
// `draft`/`eagle3` aliases with `draft-simple`/`draft-eagle3`. We
// normalize each token here so existing model configs keep working.
auto normalize_spec_name = [](std::string s) -> std::string {
std::replace(s.begin(), s.end(), '_', '-');
if (s == "draft") return "draft-simple";
if (s == "eagle3") return "draft-eagle3";
return s;
};
std::vector<std::string> names; std::vector<std::string> names;
std::string item; std::string item;
for (char c : optval_str) { for (char c : optval_str) {
if (c == ',') { if (c == ',') {
if (!item.empty()) { names.push_back(item); item.clear(); } if (!item.empty()) { names.push_back(normalize_spec_name(item)); item.clear(); }
} else { } else {
item.push_back(c); item.push_back(c);
} }
} }
if (!item.empty()) names.push_back(item); if (!item.empty()) names.push_back(normalize_spec_name(item));
auto parsed = common_speculative_types_from_names(names); auto parsed = common_speculative_types_from_names(names);
if (!parsed.empty()) { if (!parsed.empty()) {
params.speculative.types = parsed; params.speculative.types = parsed;
@@ -937,6 +1114,20 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
params.kv_overrides.back().key[0] = 0; params.kv_overrides.back().key[0] = 0;
} }
// tensor_buft_overrides sentinel termination (mirrors upstream common/arg.cpp).
// Real entries are pushed during option parsing; here we pad/terminate so the
// model loader sees back().pattern == nullptr (GGML_ASSERT at common.cpp:1543)
// and so llama_params_fit has the placeholder slots it requires.
{
const size_t ntbo = llama_max_tensor_buft_overrides();
while (params.tensor_buft_overrides.size() < ntbo) {
params.tensor_buft_overrides.push_back({nullptr, nullptr});
}
}
if (!params.speculative.draft.tensor_buft_overrides.empty()) {
params.speculative.draft.tensor_buft_overrides.push_back({nullptr, nullptr});
}
// TODO: Add yarn // TODO: Add yarn
if (!request->tensorsplit().empty()) { if (!request->tensorsplit().empty()) {
@@ -2794,7 +2985,9 @@ public:
} }
} }
int embd_normalize = 2; // default to Euclidean/L2 norm // Honor the load-time embd_normalize set via options:embd_normalize.
// -1 none, 0 max-abs, 1 taxicab, 2 L2 (default), >2 p-norm.
int embd_normalize = params_base.embd_normalize;
// create and queue the task // create and queue the task
auto rd = ctx_server.get_response_reader(); auto rd = ctx_server.get_response_reader();
{ {

View File

@@ -1,7 +1,7 @@
# Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant. # Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
# Auto-bumped nightly by .github/workflows/bump_deps.yaml. # Auto-bumped nightly by .github/workflows/bump_deps.yaml.
TURBOQUANT_VERSION?=69d8e4be47243e83b3d0d71e932bc7aa61c644dc TURBOQUANT_VERSION?=5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403
LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant
CMAKE_ARGS?= CMAKE_ARGS?=

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# acestep.cpp version # acestep.cpp version
ACESTEP_REPO?=https://github.com/ace-step/acestep.cpp ACESTEP_REPO?=https://github.com/ace-step/acestep.cpp
ACESTEP_CPP_VERSION?=e0c8d75a672fca5684c88c68dbf6d12f58754258 ACESTEP_CPP_VERSION?=ed53caf164e4492a5620b2e3f2264629cf66da24
SO_TARGET?=libgoacestepcpp.so SO_TARGET?=libgoacestepcpp.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -22,12 +22,11 @@
#include <vector> #include <vector>
// Global model contexts (loaded once, reused across requests) // Global model contexts (loaded once, reused across requests)
static DiTGGML g_dit = {}; static DiTGGML g_dit = {};
static DiTGGMLConfig g_dit_cfg; static VAEGGML g_vae = {};
static VAEGGML g_vae = {}; static bool g_dit_loaded = false;
static bool g_dit_loaded = false; static bool g_vae_loaded = false;
static bool g_vae_loaded = false; static bool g_is_turbo = false;
static bool g_is_turbo = false;
// Silence latent [15000, 64] — read once from DiT GGUF // Silence latent [15000, 64] — read once from DiT GGUF
static std::vector<float> g_silence_full; static std::vector<float> g_silence_full;
@@ -72,10 +71,9 @@ int load_model(const char * lm_model_path, const char * text_encoder_path,
g_text_enc_path = text_encoder_path; g_text_enc_path = text_encoder_path;
g_dit_path = dit_model_path; g_dit_path = dit_model_path;
// Load DiT model // Load DiT model (backend init + config are handled inside dit_ggml_load)
fprintf(stderr, "[acestep-cpp] Loading DiT from %s\n", dit_model_path); fprintf(stderr, "[acestep-cpp] Loading DiT from %s\n", dit_model_path);
dit_ggml_init_backend(&g_dit); if (!dit_ggml_load(&g_dit, dit_model_path)) {
if (!dit_ggml_load(&g_dit, dit_model_path, g_dit_cfg, nullptr, 0.0f)) {
fprintf(stderr, "[acestep-cpp] FATAL: failed to load DiT from %s\n", dit_model_path); fprintf(stderr, "[acestep-cpp] FATAL: failed to load DiT from %s\n", dit_model_path);
return 1; return 1;
} }
@@ -149,16 +147,16 @@ int generate_music(const char * caption, const char * lyrics, int bpm,
// Compute T (latent frames at 25Hz) // Compute T (latent frames at 25Hz)
int T = (int)(duration * FRAMES_PER_SECOND); int T = (int)(duration * FRAMES_PER_SECOND);
T = ((T + g_dit_cfg.patch_size - 1) / g_dit_cfg.patch_size) * g_dit_cfg.patch_size; T = ((T + g_dit.cfg.patch_size - 1) / g_dit.cfg.patch_size) * g_dit.cfg.patch_size;
int S = T / g_dit_cfg.patch_size; int S = T / g_dit.cfg.patch_size;
if (T > 15000) { if (T > 15000) {
fprintf(stderr, "[acestep-cpp] ERROR: T=%d exceeds max 15000\n", T); fprintf(stderr, "[acestep-cpp] ERROR: T=%d exceeds max 15000\n", T);
return 2; return 2;
} }
int Oc = g_dit_cfg.out_channels; // 64 int Oc = g_dit.cfg.out_channels; // 64
int ctx_ch = g_dit_cfg.in_channels - Oc; // 128 int ctx_ch = g_dit.cfg.in_channels - Oc; // 128
fprintf(stderr, "[acestep-cpp] T=%d, S=%d, duration=%.1fs, seed=%d\n", T, S, duration, seed); fprintf(stderr, "[acestep-cpp] T=%d, S=%d, duration=%.1fs, seed=%d\n", T, S, duration, seed);
@@ -191,9 +189,8 @@ int generate_music(const char * caption, const char * lyrics, int bpm,
fprintf(stderr, "[acestep-cpp] caption: %d tokens, lyrics: %d tokens\n", S_text, S_lyric); fprintf(stderr, "[acestep-cpp] caption: %d tokens, lyrics: %d tokens\n", S_text, S_lyric);
// 4. Text encoder forward // 4. Text encoder forward (backend init handled inside qwen3_load_text_encoder)
Qwen3GGML text_enc = {}; Qwen3GGML text_enc = {};
qwen3_init_backend(&text_enc);
if (!qwen3_load_text_encoder(&text_enc, g_text_enc_path.c_str())) { if (!qwen3_load_text_encoder(&text_enc, g_text_enc_path.c_str())) {
fprintf(stderr, "[acestep-cpp] FATAL: failed to load text encoder\n"); fprintf(stderr, "[acestep-cpp] FATAL: failed to load text encoder\n");
return 4; return 4;
@@ -209,9 +206,8 @@ int generate_music(const char * caption, const char * lyrics, int bpm,
std::vector<float> lyric_embed(H_text * S_lyric); std::vector<float> lyric_embed(H_text * S_lyric);
qwen3_embed_lookup(&text_enc, lyric_ids.data(), S_lyric, lyric_embed.data()); qwen3_embed_lookup(&text_enc, lyric_ids.data(), S_lyric, lyric_embed.data());
// 6. Condition encoder // 6. Condition encoder (backend init handled inside cond_ggml_load)
CondGGML cond = {}; CondGGML cond = {};
cond_ggml_init_backend(&cond);
if (!cond_ggml_load(&cond, g_dit_path.c_str())) { if (!cond_ggml_load(&cond, g_dit_path.c_str())) {
fprintf(stderr, "[acestep-cpp] FATAL: failed to load condition encoder\n"); fprintf(stderr, "[acestep-cpp] FATAL: failed to load condition encoder\n");
qwen3_free(&text_enc); qwen3_free(&text_enc);

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# stablediffusion.cpp (ggml) # stablediffusion.cpp (ggml)
STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
STABLEDIFFUSION_GGML_VERSION?=90e87bc846f17059771efb8aaa31e9ef0cab6f78 STABLEDIFFUSION_GGML_VERSION?=0baf721215f45335a5df8caf0ecb34e870c956e7
CMAKE_ARGS+=-DGGML_MAX_NAME=128 CMAKE_ARGS+=-DGGML_MAX_NAME=128

View File

@@ -1188,6 +1188,9 @@ int gen_video(sd_vid_gen_params_t *p, int steps, char *dst, float cfg_scale, int
p->high_noise_sample_params.scheduler = scheduler; p->high_noise_sample_params.scheduler = scheduler;
p->high_noise_sample_params.flow_shift = flow_shift; p->high_noise_sample_params.flow_shift = flow_shift;
// Pin output fps in params; upstream uses it for audio sync (and we also mux at this rate).
p->fps = fps;
// Load init/end reference images if provided (resized to output dims). // Load init/end reference images if provided (resized to output dims).
uint8_t* init_buf = nullptr; uint8_t* init_buf = nullptr;
uint8_t* end_buf = nullptr; uint8_t* end_buf = nullptr;
@@ -1206,11 +1209,14 @@ int gen_video(sd_vid_gen_params_t *p, int steps, char *dst, float cfg_scale, int
// Generate // Generate
int num_frames_out = 0; int num_frames_out = 0;
sd_image_t* frames = generate_video(sd_c, p, &num_frames_out); sd_image_t* frames = nullptr;
sd_audio_t* audio = nullptr;
bool ok = generate_video(sd_c, p, &frames, &num_frames_out, &audio);
std::free(p); std::free(p);
if (!frames || num_frames_out == 0) { if (!ok || !frames || num_frames_out == 0) {
fprintf(stderr, "generate_video produced no frames\n"); fprintf(stderr, "generate_video produced no frames\n");
if (audio) free_sd_audio(audio);
if (init_buf) free(init_buf); if (init_buf) free(init_buf);
if (end_buf) free(end_buf); if (end_buf) free(end_buf);
return 1; return 1;
@@ -1224,6 +1230,7 @@ int gen_video(sd_vid_gen_params_t *p, int steps, char *dst, float cfg_scale, int
if (frames[i].data) free(frames[i].data); if (frames[i].data) free(frames[i].data);
} }
free(frames); free(frames);
if (audio) free_sd_audio(audio);
if (init_buf) free(init_buf); if (init_buf) free(init_buf);
if (end_buf) free(end_buf); if (end_buf) free(end_buf);

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# whisper.cpp version # whisper.cpp version
WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
WHISPER_CPP_VERSION?=c33c5618b72bb345df029b730b36bc0e369845a3 WHISPER_CPP_VERSION?=0ccd896f5b882628e1c077f9769735ef4ce52860
SO_TARGET?=libgowhisper.so SO_TARGET?=libgowhisper.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -847,6 +847,35 @@
nvidia-l4t-cuda-12: "nvidia-l4t-vibevoice" nvidia-l4t-cuda-12: "nvidia-l4t-vibevoice"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vibevoice" nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vibevoice"
icon: https://avatars.githubusercontent.com/u/6154722?s=200&v=4 icon: https://avatars.githubusercontent.com/u/6154722?s=200&v=4
- &liquid-audio
urls:
- https://github.com/Liquid4All/liquid-audio
- https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B
description: |
LiquidAI LFM2 / LFM2.5 Audio Python backend. End-to-end speech-to-speech, ASR,
TTS (4 baked voices), and text chat from a single 1.5B model. Wraps the
upstream `liquid-audio` package; supports fine-tuning via LocalAI's
/v1/fine-tuning/jobs endpoint.
tags:
- speech-to-speech
- any-to-any
- text-to-speech
- speech-to-text
- TTS
- ASR
- realtime
license: LFM-Open-License-v1.0
name: "liquid-audio"
alias: "liquid-audio"
capabilities:
nvidia: "cuda12-liquid-audio"
intel: "intel-liquid-audio"
amd: "rocm-liquid-audio"
default: "cpu-liquid-audio"
nvidia-cuda-13: "cuda13-liquid-audio"
nvidia-cuda-12: "cuda12-liquid-audio"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-liquid-audio"
icon: https://cdn-avatars.huggingface.co/v1/production/uploads/61b8e2ba285851687028d395/7_6D7rWrLxp2hb6OHSV1p.png
- &qwen-tts - &qwen-tts
urls: urls:
- https://github.com/QwenLM/Qwen3-TTS - https://github.com/QwenLM/Qwen3-TTS
@@ -3437,6 +3466,77 @@
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-vibevoice" uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-vibevoice"
mirrors: mirrors:
- localai/localai-backends:master-metal-darwin-arm64-vibevoice - localai/localai-backends:master-metal-darwin-arm64-vibevoice
## liquid-audio
- !!merge <<: *liquid-audio
name: "liquid-audio-development"
capabilities:
nvidia: "cuda12-liquid-audio-development"
intel: "intel-liquid-audio-development"
amd: "rocm-liquid-audio-development"
default: "cpu-liquid-audio-development"
nvidia-cuda-13: "cuda13-liquid-audio-development"
nvidia-cuda-12: "cuda12-liquid-audio-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-liquid-audio-development"
- !!merge <<: *liquid-audio
name: "cpu-liquid-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-liquid-audio"
mirrors:
- localai/localai-backends:latest-cpu-liquid-audio
- !!merge <<: *liquid-audio
name: "cpu-liquid-audio-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-liquid-audio"
mirrors:
- localai/localai-backends:master-cpu-liquid-audio
- !!merge <<: *liquid-audio
name: "cuda12-liquid-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-liquid-audio"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-liquid-audio
- !!merge <<: *liquid-audio
name: "cuda12-liquid-audio-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-liquid-audio"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-liquid-audio
- !!merge <<: *liquid-audio
name: "cuda13-liquid-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-liquid-audio"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-liquid-audio
- !!merge <<: *liquid-audio
name: "cuda13-liquid-audio-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-liquid-audio"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-liquid-audio
- !!merge <<: *liquid-audio
name: "intel-liquid-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-liquid-audio"
mirrors:
- localai/localai-backends:latest-gpu-intel-liquid-audio
- !!merge <<: *liquid-audio
name: "intel-liquid-audio-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-liquid-audio"
mirrors:
- localai/localai-backends:master-gpu-intel-liquid-audio
- !!merge <<: *liquid-audio
name: "rocm-liquid-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-liquid-audio"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-liquid-audio
- !!merge <<: *liquid-audio
name: "rocm-liquid-audio-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-liquid-audio"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-liquid-audio
- !!merge <<: *liquid-audio
name: "cuda13-nvidia-l4t-arm64-liquid-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-liquid-audio"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-liquid-audio
- !!merge <<: *liquid-audio
name: "cuda13-nvidia-l4t-arm64-liquid-audio-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-liquid-audio"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-liquid-audio
## qwen-tts ## qwen-tts
- !!merge <<: *qwen-tts - !!merge <<: *qwen-tts
name: "qwen-tts-development" name: "qwen-tts-development"

View File

@@ -0,0 +1,23 @@
.PHONY: liquid-audio
liquid-audio:
bash install.sh
.PHONY: run
run: liquid-audio
@echo "Running liquid-audio..."
bash run.sh
@echo "liquid-audio run."
.PHONY: test
test: liquid-audio
@echo "Testing liquid-audio..."
bash test.sh
@echo "liquid-audio tested."
.PHONY: protogen-clean
protogen-clean:
$(RM) backend_pb2_grpc.py backend_pb2.py
.PHONY: clean
clean: protogen-clean
rm -rf venv __pycache__

View File

@@ -0,0 +1,871 @@
#!/usr/bin/env python3
"""
Liquid Audio backend for LocalAI.
Wraps LiquidAI's `liquid-audio` Python package (https://github.com/Liquid4All/liquid-audio).
The same model serves four roles, selected by the `mode` option at load time:
chat, asr, tts, s2s. Fine-tuning is exposed via StartFineTune.
"""
from concurrent import futures
import argparse
import json
import os
import queue
import signal
import sys
import threading
import time
import traceback
import uuid
import grpc
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'common'))
from grpc_auth import get_auth_interceptors # noqa: E402
from python_utils import parse_options # noqa: E402
import backend_pb2 # noqa: E402
import backend_pb2_grpc # noqa: E402
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1'))
# Voice id → system-prompt suffix. The model only ships these four voices.
VOICE_PROMPTS = {
"us_male": "Perform TTS. Use the US male voice.",
"us_female": "Perform TTS. Use the US female voice.",
"uk_male": "Perform TTS. Use the UK male voice.",
"uk_female": "Perform TTS. Use the UK female voice.",
}
DEFAULT_VOICE = "us_female"
# Special-token IDs that LFM2-Audio emits to delimit modality boundaries.
# Sourced from liquid_audio/model/lfm2_audio.py (see generate_sequential/_sample_*).
TEXT_END_TOKEN = 130 # <|text_end|>
AUDIO_START_TOKEN = 128 # <|audio_start|>
IM_END_TOKEN = 7 # <|im_end|>
AUDIO_EOS_CODE = 2048 # signals end-of-audio in any codebook position
_PATCHED_LOCAL_PATHS = False
def _patch_liquid_audio_local_paths():
"""Make liquid_audio.utils.get_model_dir() tolerate local directories.
Upstream always passes its argument to huggingface_hub.snapshot_download,
which only accepts `owner/repo` ids. LocalAI's gallery hands us absolute
paths under <ModelPath>/<owner>/<repo>, so we intercept snapshot_download
in the liquid_audio.utils namespace and return the directory as-is when
it already exists on disk. Idempotent.
"""
global _PATCHED_LOCAL_PATHS
if _PATCHED_LOCAL_PATHS:
return
import liquid_audio.utils as _la_utils
_orig_snapshot_download = _la_utils.snapshot_download
def _local_first_snapshot_download(repo_id, revision=None, **kwargs):
if isinstance(repo_id, (str, os.PathLike)) and os.path.isdir(str(repo_id)):
return str(repo_id)
return _orig_snapshot_download(repo_id, revision=revision, **kwargs)
_la_utils.snapshot_download = _local_first_snapshot_download
_PATCHED_LOCAL_PATHS = True
def _select_device():
import torch
if torch.cuda.is_available():
return "cuda"
if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
return "mps"
return "cpu"
class ActiveJob:
"""Tracks an in-flight fine-tune so FineTuneProgress can stream from its queue."""
def __init__(self, job_id):
self.job_id = job_id
self.progress_queue = queue.Queue()
self.thread = None
self.stopped = False
self.completed = False
self.error = None
class BackendServicer(backend_pb2_grpc.BackendServicer):
def __init__(self):
self.processor = None
self.model = None
self.device = "cpu"
self.dtype = None
self.options = {}
self.model_id = None
self.active_job = None
@property
def mode(self):
return str(self.options.get("mode", "chat")).lower()
@property
def voice(self):
v = str(self.options.get("voice", DEFAULT_VOICE)).lower()
return v if v in VOICE_PROMPTS else DEFAULT_VOICE
def Free(self, request, context):
# Called by LocalAI when unloading the model. Drop GPU tensors so the
# next load starts from a clean state instead of bumping into OOM.
try:
for attr in ("model", "processor", "tokenizer"):
if hasattr(self, attr):
try:
delattr(self, attr)
except Exception:
pass
import gc
gc.collect()
try:
import torch
if torch.cuda.is_available():
torch.cuda.empty_cache()
except Exception:
pass
return backend_pb2.Result(success=True, message="OK")
except Exception as exc:
print(f"Free failed: {exc}", file=sys.stderr)
return backend_pb2.Result(success=False, message=str(exc))
def Health(self, request, context):
return backend_pb2.Reply(message=bytes("OK", 'utf-8'))
def LoadModel(self, request, context):
try:
import torch
self.options = parse_options(request.Options)
if self.options.get("voice") and self.options["voice"] not in VOICE_PROMPTS:
print(f"Warning: unknown voice '{self.options['voice']}'; defaulting to '{DEFAULT_VOICE}'",
file=sys.stderr)
requested_device = self.options.get("device")
self.device = requested_device or _select_device()
if self.device == "cuda" and not torch.cuda.is_available():
return backend_pb2.Result(success=False, message="CUDA requested but not available")
if self.device == "mps" and not (hasattr(torch.backends, "mps") and
torch.backends.mps.is_available()):
print("MPS not available; falling back to CPU", file=sys.stderr)
self.device = "cpu"
dtype_name = str(self.options.get("dtype", "bfloat16")).lower()
self.dtype = {
"bfloat16": torch.bfloat16,
"bf16": torch.bfloat16,
"float16": torch.float16,
"fp16": torch.float16,
"half": torch.float16,
"float32": torch.float32,
"fp32": torch.float32,
}.get(dtype_name, torch.bfloat16)
# request.Model holds the raw `parameters.model` value (an HF
# repo id like "LiquidAI/LFM2.5-Audio-1.5B"); request.ModelFile
# is LocalAI's ModelPath-prefixed local copy that exists only
# when the gallery supplied a `files:` list. Mirror the
# transformers/vibevoice convention: prefer the repo id and
# only switch to the local path if it's been staged on disk.
model_id = request.Model
if not model_id:
model_id = request.ModelFile
if not model_id:
return backend_pb2.Result(success=False, message="No model identifier provided")
if request.ModelFile and os.path.isdir(request.ModelFile):
model_id = request.ModelFile
self.model_id = model_id
# Pure fine-tune jobs don't need an in-memory inference model — the
# Trainer instantiates its own copy at StartFineTune time.
if self.mode == "finetune":
print(f"Loaded liquid-audio backend in fine-tune mode (model id: {model_id})",
file=sys.stderr)
return backend_pb2.Result(success=True, message="OK")
from liquid_audio import LFM2AudioModel, LFM2AudioProcessor
# liquid_audio's from_pretrained unconditionally routes through
# huggingface_hub.snapshot_download, which rejects local paths
# (HFValidationError on `/models/LiquidAI/LFM2.5-Audio-1.5B`).
# When LocalAI's gallery has already staged the weights on disk,
# short-circuit the download to return the local directory.
_patch_liquid_audio_local_paths()
print(f"Loading liquid-audio model '{model_id}' on {self.device} ({self.dtype})",
file=sys.stderr)
self.processor = LFM2AudioProcessor.from_pretrained(model_id, device=self.device).eval()
self.model = LFM2AudioModel.from_pretrained(
model_id, device=self.device, dtype=self.dtype
).eval()
print(f"Liquid-audio mode={self.mode}, voice={self.voice}", file=sys.stderr)
return backend_pb2.Result(success=True, message="OK")
except Exception as exc:
print(f"LoadModel failed: {exc}", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
return backend_pb2.Result(success=False, message=str(exc))
def Predict(self, request, context):
try:
text = "".join(self._generate_text_stream(request))
return backend_pb2.Reply(message=text.encode("utf-8"))
except Exception as exc:
print(f"Predict failed: {exc}", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(str(exc))
return backend_pb2.Reply()
def PredictStream(self, request, context):
try:
for delta in self._generate_text_stream(request):
yield backend_pb2.Reply(message=delta.encode("utf-8"))
except Exception as exc:
print(f"PredictStream failed: {exc}", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(str(exc))
def VAD(self, request, context):
# Stub voice-activity detector: RMS-energy threshold over 30ms frames at
# 16 kHz. Good enough for the realtime endpoint's handleVAD loop, which
# only inspects segment presence + last segment end. The proper signal
# would come from the model's audio encoder, but that ride-along is a
# PR-D scope item — until then this keeps the legacy pipeline path
# working without forcing the operator to install a separate VAD model.
import numpy as np
try:
audio = np.asarray(request.audio, dtype=np.float32)
if audio.size == 0:
return backend_pb2.VADResponse(segments=[])
sample_rate = 16000
frame_size = sample_rate * 30 // 1000 # 30ms → 480 samples
threshold = float(self.options.get("vad_rms_threshold", 0.01))
min_speech_frames = int(self.options.get("vad_min_speech_frames", 2)) # ≥60ms
# handleVAD ticks every 300 ms and only inspects segment presence
# + last segment end relative to silence_threshold (~500 ms). Cap
# the analysed window to the tail of the buffer so we don't redo
# the entire growing utterance every tick.
window_s = float(self.options.get("vad_window_s", 5.0))
window_samples = int(window_s * sample_rate)
time_offset_s = 0.0
if audio.size > window_samples:
time_offset_s = (audio.size - window_samples) / sample_rate
audio = audio[-window_samples:]
n_frames = audio.size // frame_size
if n_frames == 0:
return backend_pb2.VADResponse(segments=[])
frames = audio[: n_frames * frame_size].reshape(n_frames, frame_size)
rms = np.sqrt(np.mean(frames ** 2, axis=1))
speech = rms > threshold
def _emit(start_idx, end_idx, out):
if end_idx - start_idx >= min_speech_frames:
out.append(backend_pb2.VADSegment(
start=time_offset_s + start_idx * frame_size / sample_rate,
end=time_offset_s + end_idx * frame_size / sample_rate,
))
segments = []
start_idx = None
for i, is_speech in enumerate(speech):
if is_speech and start_idx is None:
start_idx = i
elif not is_speech and start_idx is not None:
_emit(start_idx, i, segments)
start_idx = None
if start_idx is not None:
_emit(start_idx, n_frames, segments)
return backend_pb2.VADResponse(segments=segments)
except Exception as exc:
print(f"VAD failed: {exc}", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(str(exc))
return backend_pb2.VADResponse(segments=[])
def TTS(self, request, context):
try:
if self.model is None or self.processor is None:
return backend_pb2.Result(success=False, message="Model not loaded")
import torch
import torchaudio
from liquid_audio import ChatState
voice = request.voice.lower() if request.voice else self.voice
voice = voice.removeprefix("lfm2:").removeprefix("lfm:")
if voice not in VOICE_PROMPTS:
voice = self.voice
system_prompt = VOICE_PROMPTS[voice]
chat = ChatState(self.processor)
chat.new_turn("system")
chat.add_text(system_prompt)
chat.end_turn()
chat.new_turn("user")
chat.add_text(request.text or "")
chat.end_turn()
chat.new_turn("assistant")
audio_top_k = int(self.options.get("audio_top_k", 64))
audio_temp = float(self.options.get("audio_temperature", 0.8))
max_new = int(self.options.get("max_new_tokens", 2048))
audio_out = []
for tok in self.model.generate_sequential(
**chat,
max_new_tokens=max_new,
audio_temperature=audio_temp,
audio_top_k=audio_top_k,
):
if tok.numel() > 1:
audio_out.append(tok)
if len(audio_out) <= 1:
return backend_pb2.Result(success=False, message="No audio frames generated")
# Drop the trailing end-of-audio frame, matching the package's examples.
audio_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
waveform = self.processor.decode(audio_codes)
out_path = request.dst
if not out_path:
return backend_pb2.Result(success=False, message="dst path is required")
os.makedirs(os.path.dirname(out_path) or ".", exist_ok=True)
# soundfile in preference to torchaudio.save — the latter routes
# through torchcodec, whose native libs need NVIDIA NPP that we
# don't bundle in the cuda13 image.
import soundfile as _sf
_sf.write(out_path, waveform.cpu().numpy().squeeze(0).T, 24_000)
return backend_pb2.Result(success=True)
except Exception as exc:
print(f"TTS failed: {exc}", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
return backend_pb2.Result(success=False, message=str(exc))
def AudioToAudioStream(self, request_iterator, context):
"""Bidirectional any-to-any speech-to-speech stream.
See `backend.proto` AudioToAudioStream for the wire protocol. Audio
is decoded once per turn here; chunked detokenization for sub-second
TTFB is left to a future iteration once the LFM2AudioDetokenizer
gains a streaming entry point.
"""
try:
yield from self._audio_to_audio_stream(request_iterator, context)
except Exception as exc:
print(f"AudioToAudioStream failed: {exc}", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
yield backend_pb2.AudioToAudioResponse(
event="error",
meta=json.dumps({"message": str(exc)}).encode("utf-8"),
)
def _audio_to_audio_stream(self, request_iterator, context):
if self.model is None or self.processor is None:
raise RuntimeError("Model not loaded")
import torch
import torchaudio
from liquid_audio import ChatState
cfg = None
chat = None
input_sample_rate = 16000
output_sample_rate = 24000
sequence = 0
def _new_event(event, **kwargs):
nonlocal sequence
sequence += 1
kwargs.setdefault("sequence", sequence)
return backend_pb2.AudioToAudioResponse(event=event, **kwargs)
def _ensure_chat():
"""Build a fresh ChatState seeded with the system prompt."""
nonlocal chat
chat = ChatState(self.processor)
system_prompt = (cfg.system_prompt if cfg and cfg.system_prompt
else "Respond with interleaved text and audio.")
chat.new_turn("system")
chat.add_text(system_prompt)
chat.end_turn()
# Buffers for the in-flight user turn
pcm_buffer = bytearray()
def _consume_user_turn():
nonlocal pcm_buffer
if not pcm_buffer:
return
# Avoid the bytes(pcm_buffer) copy and let the float widen happen
# in-place: numpy view → torch view → in-place divide.
import numpy as np
arr = np.frombuffer(memoryview(pcm_buffer), dtype=np.int16)
wav = torch.from_numpy(arr).to(torch.float32).div_(32768.0).unsqueeze(0)
chat.new_turn("user")
chat.add_audio(wav, input_sample_rate)
chat.end_turn()
pcm_buffer = bytearray()
def _run_generation():
"""Run generate_interleaved; yield response events as we go."""
chat.new_turn("assistant")
audio_top_k = int(self.options.get("audio_top_k", 4))
audio_temp = float(self.options.get("audio_temperature", 1.0))
text_top_k = int(self.options.get("text_top_k", 0)) or None
text_temp = float(self.options.get("text_temperature", 0)) or None
max_new = int(self.options.get("max_new_tokens", 512))
audio_tokens = []
for tok in self.model.generate_interleaved(
**chat,
max_new_tokens=max_new,
text_temperature=text_temp,
text_top_k=text_top_k,
audio_temperature=audio_temp,
audio_top_k=audio_top_k,
):
if tok.numel() == 1:
if tok.item() == IM_END_TOKEN:
break
text = self.processor.text.decode(tok)
if not text:
continue
yield _new_event(
"response.audio_transcript.delta",
meta=json.dumps({"delta": text}).encode("utf-8"),
)
else:
audio_tokens.append(tok)
# Detokenize the accumulated audio at end-of-turn — the
# LFM2AudioDetokenizer is non-streaming today.
if len(audio_tokens) > 1:
audio_codes = torch.stack(audio_tokens[:-1], 1).unsqueeze(0)
waveform = self.processor.decode(audio_codes)
# Convert to s16le PCM bytes at output_sample_rate
if output_sample_rate != 24000:
waveform = torchaudio.functional.resample(
waveform.cpu(), 24000, output_sample_rate
)
pcm = (waveform.cpu().squeeze(0).clamp(-1, 1) * 32767.0).to(
torch.int16
).numpy().tobytes()
yield _new_event(
"response.audio.delta",
pcm=pcm,
sample_rate=output_sample_rate,
)
yield _new_event("response.done", meta=b"{}")
for req in request_iterator:
if not context.is_active():
return
payload = req.WhichOneof("payload")
if payload == "config":
cfg = req.config
if cfg.input_sample_rate > 0:
input_sample_rate = cfg.input_sample_rate
if cfg.output_sample_rate > 0:
output_sample_rate = cfg.output_sample_rate
# The first config implicitly resets state.
_ensure_chat()
pcm_buffer = bytearray()
elif payload == "frame":
if chat is None:
_ensure_chat()
if req.frame.pcm:
pcm_buffer.extend(req.frame.pcm)
if req.frame.end_of_input:
_consume_user_turn()
yield from _run_generation()
elif payload == "control":
event = req.control.event
if event == "input_audio_buffer.commit":
_consume_user_turn()
yield from _run_generation()
elif event == "response.cancel":
# Synchronous generation here means cancel can only
# take effect between turns; we ack so the client unblocks.
yield _new_event("response.done", meta=b'{"cancelled":true}')
elif event == "session.update":
# Free-form session re-config; treat as a soft reset.
_ensure_chat()
pcm_buffer = bytearray()
# Unknown events are ignored — forward-compatible.
def AudioTranscription(self, request, context):
try:
if self.model is None or self.processor is None:
return backend_pb2.TranscriptResult(segments=[], text="")
import torchaudio
from liquid_audio import ChatState
audio_path = request.dst
if not audio_path:
return backend_pb2.TranscriptResult(segments=[], text="")
chat = ChatState(self.processor)
chat.new_turn("system")
chat.add_text("Perform ASR.")
chat.end_turn()
chat.new_turn("user")
# soundfile in preference to torchaudio.load — the latter routes
# through torchcodec which needs NVIDIA NPP libs we don't bundle.
import soundfile as _sf
import torch
audio_np, sr = _sf.read(audio_path, dtype="float32", always_2d=True)
wav = torch.from_numpy(audio_np.T) # (channels, samples)
if wav.shape[0] > 1:
# Down-mix to mono — the processor expects a single channel
wav = wav.mean(dim=0, keepdim=True)
chat.add_audio(wav, sr)
chat.end_turn()
chat.new_turn("assistant")
max_new = int(self.options.get("max_new_tokens", 1024))
pieces = []
for tok in self.model.generate_sequential(**chat, max_new_tokens=max_new):
if tok.numel() == 1:
if tok.item() == IM_END_TOKEN:
break
pieces.append(self.processor.text.decode(tok))
text = "".join(pieces).strip()
duration_ms = int((wav.shape[1] / sr) * 1000)
segment = backend_pb2.TranscriptSegment(
id=0, start=0, end=duration_ms, text=text, tokens=[],
)
return backend_pb2.TranscriptResult(segments=[segment], text=text)
except Exception as exc:
print(f"AudioTranscription failed: {exc}", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
return backend_pb2.TranscriptResult(segments=[], text="")
def StartFineTune(self, request, context):
if self.active_job is not None and not self.active_job.completed:
return backend_pb2.FineTuneJobResult(
job_id="", success=False,
message="A fine-tuning job is already running",
)
job_id = request.job_id or str(uuid.uuid4())
job = ActiveJob(job_id)
self.active_job = job
thread = threading.Thread(target=self._run_training, args=(request, job), daemon=True)
job.thread = thread
thread.start()
return backend_pb2.FineTuneJobResult(
job_id=job_id, success=True, message="Training started",
)
def FineTuneProgress(self, request, context):
if self.active_job is None or self.active_job.job_id != request.job_id:
context.set_code(grpc.StatusCode.NOT_FOUND)
context.set_details(f"Job {request.job_id} not found")
return
job = self.active_job
while True:
try:
update = job.progress_queue.get(timeout=1.0)
except queue.Empty:
if job.completed or job.stopped:
break
if not context.is_active():
break
continue
if update is None:
break
yield update
if update.status in ("completed", "failed", "stopped"):
break
def StopFineTune(self, request, context):
# We can't kill the Accelerate training loop mid-step cleanly from here;
# LocalAI's job manager kills the backend process on stop. The flag below
# at least lets the progress stream terminate quickly.
if self.active_job is not None and self.active_job.job_id == request.job_id:
self.active_job.stopped = True
self.active_job.progress_queue.put(None)
return backend_pb2.Result(success=True, message="OK")
def _run_training(self, request, job):
try:
self._do_train(request, job)
job.completed = True
job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
job_id=job.job_id, status="completed", message="Training completed",
progress_percent=100.0,
))
except Exception as exc:
job.error = str(exc)
job.completed = True
print(f"Training failed: {exc}", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
job_id=job.job_id, status="failed", message=str(exc),
))
finally:
job.progress_queue.put(None)
def _do_train(self, request, job):
from liquid_audio import LFM2AudioModel # noqa: F401 (sanity import)
from liquid_audio.data.dataloader import LFM2DataLoader
from liquid_audio.trainer import Trainer
model_id = request.model or self.model_id or "LiquidAI/LFM2.5-Audio-1.5B"
dataset_path = request.dataset_source
if not dataset_path:
raise ValueError("dataset_source is required (path to a preprocessed dataset)")
extras = dict(request.extra_options) if request.extra_options else {}
val_path = extras.get("val_dataset")
# Map FineTuneRequest hyperparameters to liquid_audio.Trainer constructor args
lr = request.learning_rate or 3e-5
max_steps = request.max_steps or 1000
warmup_steps = request.warmup_steps or min(100, max_steps // 10)
batch_size = request.batch_size or 16
save_interval = request.save_steps or max(1, max_steps // 4)
output_dir = request.output_dir or os.path.join(
os.environ.get("LIQUID_AUDIO_OUTPUT_DIR", "/tmp"),
f"liquid-audio-{job.job_id}",
)
os.makedirs(output_dir, exist_ok=True)
job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
job_id=job.job_id, status="loading_dataset",
message=f"Loading preprocessed dataset from {dataset_path}",
))
train_data = LFM2DataLoader(dataset_path)
val_data = LFM2DataLoader(val_path) if val_path else None
job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
job_id=job.job_id, status="loading_model",
message=f"Loading base model {model_id}",
))
# The Liquid Trainer logs via self.accelerator.print; we subclass it to
# also push progress events onto the queue every logging_interval steps.
progress_q = job.progress_queue
class QueuedTrainer(Trainer):
def log(self_, model_output):
if self_.step > 0 and self_.step % self_.logging_interval == 0:
try:
loss = self_.accelerator.reduce(
model_output.loss.detach(), reduction="mean"
).item()
except Exception:
loss = float("nan")
lr_now = self_.optimizer.param_groups[0]["lr"]
pct = (self_.step / self_.max_steps * 100.0) if self_.max_steps else 0.0
progress_q.put(backend_pb2.FineTuneProgressUpdate(
job_id=job.job_id,
current_step=int(self_.step),
total_steps=int(self_.max_steps),
current_epoch=float(self_.epoch),
loss=float(loss),
learning_rate=float(lr_now),
progress_percent=float(pct),
status="training",
))
# Honour stop requests: raising here terminates the loop cleanly
if job.stopped:
raise KeyboardInterrupt("stop requested")
return super().log(model_output)
def validate(self_):
progress_q.put(backend_pb2.FineTuneProgressUpdate(
job_id=job.job_id, current_step=int(self_.step),
total_steps=int(self_.max_steps), status="training",
message=f"Running validation at step {self_.step}",
))
return super().validate()
trainer = QueuedTrainer(
model_id=model_id,
train_data=train_data,
val_data=val_data,
lr=lr,
max_steps=max_steps,
warmup_steps=warmup_steps,
batch_size=batch_size,
save_interval=save_interval,
output_dir=output_dir,
weight_decay=request.weight_decay or 0.1,
)
job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
job_id=job.job_id, status="training", message="Training started",
total_steps=int(max_steps),
))
trainer.train()
job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
job_id=job.job_id, status="saving",
message=f"Saved final model to {output_dir}",
checkpoint_path=os.path.join(output_dir, "final"),
))
def _build_chat_state(self, messages, user_prompt, tools_prelude=None):
"""Build a ChatState from a list of (role, content) tuples plus an optional final user turn.
tools_prelude, when non-empty, is prepended as an extra system turn carrying
the LFM2 tool-list block — mirrors gallery/lfm.yaml's `function:` template
so the model sees the same prompt shape whether served via llama-cpp or here.
"""
from liquid_audio import ChatState
chat = ChatState(self.processor)
if tools_prelude:
chat.new_turn("system")
chat.add_text(tools_prelude)
chat.end_turn()
for role, content in messages:
chat.new_turn(role)
chat.add_text(content)
chat.end_turn()
if user_prompt:
chat.new_turn("user")
chat.add_text(user_prompt)
chat.end_turn()
chat.new_turn("assistant")
return chat
def _collect_messages(self, request):
"""Translate PredictOptions.Messages into (role, content) tuples."""
out = []
for m in request.Messages:
role = (m.role or "user").lower()
if role not in ("system", "user", "assistant"):
role = "user"
out.append((role, m.content or ""))
return out
def _render_tools_prelude(self, request):
"""Build the LFM2 `<|tool_list_start|>…<|tool_list_end|>` system prelude
from request.Tools (OpenAI Chat-Completions tool JSON). Returns "" when
no tools are attached. Output mirrors gallery/lfm.yaml's `function:`
template so the model sees the same prompt whether routed via llama-cpp
or this backend."""
tools_raw = getattr(request, "Tools", "") or ""
if not tools_raw:
return ""
try:
tools = json.loads(tools_raw)
except json.JSONDecodeError:
print(f"liquid-audio: ignoring malformed Tools JSON: {tools_raw[:200]!r}",
file=sys.stderr)
return ""
if not isinstance(tools, list) or not tools:
return ""
# The LFM2 chat template uses single-quoted Python-dict-ish syntax in
# examples, but the tokenizer treats this whole block as opaque text;
# JSON works fine and is what other backends emit.
return (
"You are a function calling AI model. You are provided with functions to "
"execute. You may call one or more functions to assist with the user query. "
"Don't make assumptions about what values to plug into functions.\n"
"List of tools: <|tool_list_start|>"
+ json.dumps(tools, separators=(",", ":"))
+ "<|tool_list_end|>"
)
def _generate_text_stream(self, request):
"""Yield text-only deltas from generate_sequential. Caller joins for unary Predict."""
if self.model is None or self.processor is None:
raise RuntimeError("Model not loaded")
messages = self._collect_messages(request)
user_prompt = request.Prompt or None
tools_prelude = self._render_tools_prelude(request)
# If the request already carries Messages, Prompt is the templated form
# of the same content — don't append a duplicate user turn.
chat = self._build_chat_state(
messages,
user_prompt if not messages else None,
tools_prelude=tools_prelude,
)
max_new = request.Tokens if request.Tokens > 0 else int(self.options.get("max_new_tokens", 512))
temperature = request.Temperature if request.Temperature > 0 else None
top_k = request.TopK if request.TopK > 0 else None
for tok in self.model.generate_sequential(
**chat,
max_new_tokens=max_new,
text_temperature=temperature,
text_top_k=top_k,
):
if tok.numel() == 1:
if tok.item() == IM_END_TOKEN:
break
yield self.processor.text.decode(tok)
def serve(address):
server = grpc.server(
futures.ThreadPoolExecutor(max_workers=MAX_WORKERS),
options=[
('grpc.max_message_length', 50 * 1024 * 1024),
('grpc.max_send_message_length', 50 * 1024 * 1024),
('grpc.max_receive_message_length', 50 * 1024 * 1024),
],
interceptors=get_auth_interceptors(),
)
backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
server.add_insecure_port(address)
server.start()
print(f"Liquid-audio backend listening on {address}", file=sys.stderr, flush=True)
def stop(_signum, _frame):
server.stop(0)
sys.exit(0)
signal.signal(signal.SIGTERM, stop)
signal.signal(signal.SIGINT, stop)
try:
while True:
time.sleep(_ONE_DAY_IN_SECONDS)
except KeyboardInterrupt:
server.stop(0)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Liquid Audio gRPC backend")
parser.add_argument("--addr", default="localhost:50051", help="gRPC server address")
args = parser.parse_args()
serve(args.addr)

View File

@@ -0,0 +1,18 @@
#!/bin/bash
set -e
# liquid-audio requires Python ≥ 3.12 (per its pyproject.toml); the default
# portable Python in libbackend.sh is 3.10. Override before sourcing.
export PYTHON_VERSION="${PYTHON_VERSION:-3.12}"
export PYTHON_PATCH="${PYTHON_PATCH:-11}"
backend_dir=$(dirname $0)
if [ -d $backend_dir/common ]; then
source $backend_dir/common/libbackend.sh
else
source $backend_dir/../common/libbackend.sh
fi
# liquid-audio's torch wheels are large; allow upgrades to satisfy transitive pins
EXTRA_PIP_INSTALL_FLAGS+=" --upgrade --index-strategy=unsafe-first-match"
installRequirements

View File

@@ -0,0 +1,11 @@
#!/bin/bash
set -e
backend_dir=$(dirname $0)
if [ -d $backend_dir/common ]; then
source $backend_dir/common/libbackend.sh
else
source $backend_dir/../common/libbackend.sh
fi
runProtogen

View File

@@ -0,0 +1,13 @@
--extra-index-url https://download.pytorch.org/whl/cpu
torch>=2.8.0
torchaudio>=2.8.0
torchcodec>=0.9.1
transformers>=4.55.4
accelerate>=1.10.1
datasets>=4.8.4
einops>=0.8.1
librosa>=0.11.0
soundfile>=0.12.1
sentencepiece>=0.2.1
huggingface-hub>=1.3.0
liquid-audio>=1.2.0

View File

@@ -0,0 +1,13 @@
--extra-index-url https://download.pytorch.org/whl/cu121
torch>=2.8.0
torchaudio>=2.8.0
torchcodec>=0.9.1
transformers>=4.55.4
accelerate>=1.10.1
datasets>=4.8.4
einops>=0.8.1
librosa>=0.11.0
soundfile>=0.12.1
sentencepiece>=0.2.1
huggingface-hub>=1.3.0
liquid-audio>=1.2.0

View File

@@ -0,0 +1,13 @@
--extra-index-url https://download.pytorch.org/whl/cu130
torch>=2.8.0
torchaudio>=2.8.0
torchcodec>=0.9.1
transformers>=4.55.4
accelerate>=1.10.1
datasets>=4.8.4
einops>=0.8.1
librosa>=0.11.0
soundfile>=0.12.1
sentencepiece>=0.2.1
huggingface-hub>=1.3.0
liquid-audio>=1.2.0

View File

@@ -0,0 +1,13 @@
--extra-index-url https://download.pytorch.org/whl/rocm7.0
torch>=2.8.0
torchaudio>=2.8.0
torchcodec>=0.9.1
transformers>=4.55.4
accelerate>=1.10.1
datasets>=4.8.4
einops>=0.8.1
librosa>=0.11.0
soundfile>=0.12.1
sentencepiece>=0.2.1
huggingface-hub>=1.3.0
liquid-audio>=1.2.0

View File

@@ -0,0 +1,13 @@
--extra-index-url https://pypi.jetson-ai-lab.io/jp7/cu130
torch>=2.8.0
torchaudio>=2.8.0
torchcodec>=0.9.1
transformers>=4.55.4
accelerate>=1.10.1
datasets>=4.8.4
einops>=0.8.1
librosa>=0.11.0
soundfile>=0.12.1
sentencepiece>=0.2.1
huggingface-hub>=1.3.0
liquid-audio>=1.2.0

View File

@@ -0,0 +1,12 @@
torch>=2.8.0
torchaudio>=2.8.0
torchcodec>=0.9.1
transformers>=4.55.4
accelerate>=1.10.1
datasets>=4.8.4
einops>=0.8.1
librosa>=0.11.0
soundfile>=0.12.1
sentencepiece>=0.2.1
huggingface-hub>=1.3.0
liquid-audio>=1.2.0

View File

@@ -0,0 +1,3 @@
grpcio==1.78.1
protobuf
certifi

View File

@@ -0,0 +1,10 @@
#!/bin/bash
backend_dir=$(dirname $0)
if [ -d $backend_dir/common ]; then
source $backend_dir/common/libbackend.sh
else
source $backend_dir/../common/libbackend.sh
fi
startBackend $@

View File

@@ -0,0 +1,89 @@
"""Smoke tests for the liquid-audio backend.
These run without contacting HuggingFace or loading model weights:
they only verify that the gRPC service starts and Health() responds.
To run an end-to-end inference test, set LIQUID_AUDIO_MODEL_ID
(e.g. "LiquidAI/LFM2.5-Audio-1.5B") in the environment — see test_inference().
"""
import os
import subprocess
import sys
import time
import unittest
import grpc
# Ensure generated protobuf stubs are importable
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
import backend_pb2
import backend_pb2_grpc
class TestBackend(unittest.TestCase):
@classmethod
def setUpClass(cls):
addr = os.environ.get("LIQUID_AUDIO_TEST_ADDR", "localhost:50053")
cls.addr = addr
cls.server = subprocess.Popen(
[sys.executable, os.path.join(os.path.dirname(__file__), "backend.py"), "--addr", addr],
)
time.sleep(2) # Give the server a moment to bind
@classmethod
def tearDownClass(cls):
cls.server.terminate()
try:
cls.server.wait(timeout=5)
except subprocess.TimeoutExpired:
cls.server.kill()
def _stub(self):
channel = grpc.insecure_channel(self.addr)
return backend_pb2_grpc.BackendStub(channel)
def test_health(self):
stub = self._stub()
reply = stub.Health(backend_pb2.HealthMessage(), timeout=5)
self.assertEqual(reply.message, b"OK")
def test_load_finetune_mode_without_weights(self):
"""Loading in fine-tune mode should succeed without pulling model weights."""
stub = self._stub()
result = stub.LoadModel(
backend_pb2.ModelOptions(
Model="LiquidAI/LFM2.5-Audio-1.5B",
Options=["mode:finetune"],
),
timeout=10,
)
self.assertTrue(result.success, msg=result.message)
@unittest.skipUnless(os.environ.get("LIQUID_AUDIO_MODEL_ID"),
"Set LIQUID_AUDIO_MODEL_ID to run an end-to-end inference smoke test")
def test_inference(self):
"""End-to-end: load a real LFM2-Audio model and run one short prediction."""
stub = self._stub()
model_id = os.environ["LIQUID_AUDIO_MODEL_ID"]
result = stub.LoadModel(
backend_pb2.ModelOptions(
Model=model_id,
Options=["mode:chat"],
),
timeout=600,
)
self.assertTrue(result.success, msg=result.message)
reply = stub.Predict(
backend_pb2.PredictOptions(
Prompt="Hello!",
Tokens=8,
Temperature=0.0,
),
timeout=120,
)
self.assertGreater(len(reply.message), 0)
if __name__ == "__main__":
unittest.main()

View File

@@ -0,0 +1,11 @@
#!/bin/bash
set -e
backend_dir=$(dirname $0)
if [ -d $backend_dir/common ]; then
source $backend_dir/common/libbackend.sh
else
source $backend_dir/../common/libbackend.sh
fi
runUnittests

View File

@@ -36,15 +36,11 @@ fi
# flash-attn-4 4.0 stable lands. # flash-attn-4 4.0 stable lands.
EXTRA_PIP_INSTALL_FLAGS+=" --prerelease=allow" EXTRA_PIP_INSTALL_FLAGS+=" --prerelease=allow"
# JetPack 7 / L4T arm64 wheels are built for cp312 and shipped via # JetPack 7 / L4T arm64 sglang + torch wheels come straight from PyPI now
# pypi.jetson-ai-lab.io. Bump the venv Python so the prebuilt sglang # (torch 2.11+ ships aarch64 + cu130 manylinux wheels and sglang 0.5.11+
# wheel resolves cleanly. The actual install on l4t13 goes through # ships a cp312 aarch64 wheel pinned to that torch). They're cp312-only,
# pyproject.toml (see the elif branch below) so [tool.uv.sources] can # so bump the venv Python accordingly.
# pin only torch/torchvision/torchaudio/sglang to the jetson-ai-lab # https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
# index — leaving PyPI as the path for transitive deps like
# markdown-it-py / anthropic / propcache that the L4T mirror's proxy
# 503s on. No --index-strategy flag here: the explicit index keeps the
# scoping clean.
if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
PYTHON_VERSION="3.12" PYTHON_VERSION="3.12"
PYTHON_PATCH="12" PYTHON_PATCH="12"
@@ -110,27 +106,6 @@ if [ "x${BUILD_TYPE}" == "x" ] || [ "x${FROM_SOURCE:-}" == "xtrue" ]; then
fi fi
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} . uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} .
popd popd
# L4T arm64 (JetPack 7): drive the install through pyproject.toml so that
# [tool.uv.sources] can pin torch/torchvision/torchaudio/sglang to the
# jetson-ai-lab index, while everything else (transitive deps and
# PyPI-resolvable packages like transformers / accelerate) comes from
# PyPI. Bypasses installRequirements because uv pip install -r
# requirements.txt does not honor sources — see
# backend/python/sglang/pyproject.toml for the rationale. Mirrors the
# equivalent path in backend/python/vllm/install.sh.
elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
ensureVenv
if [ "x${PORTABLE_PYTHON}" == "xtrue" ]; then
export C_INCLUDE_PATH="${C_INCLUDE_PATH:-}:$(_portable_dir)/include/python${PYTHON_VERSION}"
fi
pushd "${backend_dir}"
# Build deps first (matches installRequirements' requirements-install.txt
# pass — sglang/sgl-kernel sdists need packaging/setuptools-scm in the
# venv before they can build under --no-build-isolation).
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} -r requirements-install.txt
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --requirement pyproject.toml
popd
runProtogen
else else
installRequirements installRequirements
fi fi

View File

@@ -1,68 +0,0 @@
# L4T arm64 (JetPack 7 / sbsa cu130) install spec for the sglang backend.
#
# Why this file exists, and why only the l4t13 BUILD_PROFILE consumes it:
#
# pypi.jetson-ai-lab.io hosts the L4T-specific torch / sglang / sgl-kernel
# wheels we need on aarch64 + cuda13, but it ALSO transparently proxies the
# rest of PyPI through `/+f/<sha>/<filename>` URLs that 503 frequently.
# With `--extra-index-url` + `--index-strategy=unsafe-best-match` (the
# historical fix in install.sh) uv would pick those proxy URLs for ordinary
# PyPI packages — markdown-it-py, anthropic, propcache, etc. — and trip on
# the 503s. See e.g. CI run 25439791228 (markdown-it-py-4.0.0).
#
# `explicit = true` on the index makes uv consult the L4T mirror ONLY for
# packages mapped under [tool.uv.sources]. Everything else goes to PyPI.
# This breaks the historical 503 path without losing access to the L4T
# wheels we actually need from there. Mirrors the equivalent fix already
# in backend/python/vllm/pyproject.toml.
#
# `uv pip install -r requirements.txt` does NOT honor [tool.uv.sources]
# (sources are project-mode only, not pip-compat mode), so install.sh's
# l4t13 branch invokes `uv pip install --requirement pyproject.toml`
# directly. Other BUILD_PROFILEs continue to use the requirements-*.txt
# pipeline through libbackend.sh's installRequirements and never read
# this file.
[project]
name = "localai-sglang-l4t13"
version = "0.0.0"
requires-python = ">=3.12,<3.13"
dependencies = [
# Mirror of requirements.txt — kept in sync manually for now since the
# l4t13 path bypasses installRequirements (see install.sh).
"grpcio==1.80.0",
"protobuf",
"certifi",
"setuptools",
"pillow",
# L4T-specific accelerator stack (sourced from jetson-ai-lab below).
"torch",
"torchvision",
"torchaudio",
# sglang on jetson — the [all] extra is deliberately omitted because it
# pulls outlines/decord, and decord has no aarch64 cp312 wheel anywhere
# (PyPI nor the jetson-ai-lab index ships only legacy cp35-cp37). With
# [all] uv backtracks through versions trying to satisfy decord and
# lands on sglang==0.1.16. The 0.5.0 floor matches the only major
# series the jetson-ai-lab sbsa/cu130 mirror currently publishes
# (sglang==0.5.1.post2 as of 2026-05-06). Bumping to >=0.5.11 here
# would make the build unsatisfiable until the mirror catches up.
# Gemma 4 / MTP recipes are therefore not supported on l4t13 — those
# features land on cublas12/cublas13 hosts that pull the newer wheel
# from PyPI. backend.py keeps backward compat with the 0.5.x SamplingParams
# field rename via runtime detection.
"sglang>=0.5.0",
# PyPI-resolvable packages that complete the runtime.
"accelerate",
"transformers",
]
[[tool.uv.index]]
name = "jetson-ai-lab"
url = "https://pypi.jetson-ai-lab.io/sbsa/cu130"
explicit = true
[tool.uv.sources]
torch = { index = "jetson-ai-lab" }
torchvision = { index = "jetson-ai-lab" }
torchaudio = { index = "jetson-ai-lab" }
sglang = { index = "jetson-ai-lab" }

View File

@@ -0,0 +1,15 @@
# sglang 0.5.11+ ships an aarch64 manylinux wheel on PyPI whose Requires-Dist
# pins torch==2.11.0 / torchaudio==2.11.0, locking an ABI-consistent set with
# the cu130 torch wheel installed above. 0.5.11 is the floor for Gemma 4
# support (sgl-project/sglang#21952).
#
# The [all] extra is deliberately NOT used on aarch64: it pulls the
# [diffusion] sub-extra which requires `xatlas`, and xatlas ships no
# aarch64 wheel and its sdist depends on scikit_build_core without
# declaring it in build-system.requires — so under --no-build-isolation
# uv can't build it. Upstream sglang gates st_attn and vsa on
# platform_machine != aarch64 in the diffusion extra but forgot xatlas.
# Plain `sglang` carries everything backend.py uses (Engine, ServerArgs,
# FunctionCallParser, ReasoningParser); the [all] extras are optional
# accelerators not required at import time.
sglang>=0.5.11

View File

@@ -0,0 +1,9 @@
# JetPack 7 / L4T arm64 + CUDA 13. Since PyTorch 2.11 (April 2026), PyPI ships
# aarch64 + cu130 manylinux wheels for torch/torchvision/torchaudio directly,
# so we no longer need a custom --extra-index-url for the L4T mirror.
# https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
accelerate
torch
torchvision
torchaudio
transformers

View File

@@ -2,9 +2,9 @@ torch==2.7.1
llvmlite==0.43.0 llvmlite==0.43.0
numba==0.60.0 numba==0.60.0
accelerate accelerate
transformers>=5.8.0 transformers>=5.8.1
bitsandbytes bitsandbytes
sentence-transformers==5.4.0 sentence-transformers==5.5.0
diffusers diffusers
soundfile soundfile
protobuf==6.33.5 protobuf==6.33.5

View File

@@ -2,9 +2,9 @@ torch==2.7.1
accelerate accelerate
llvmlite==0.43.0 llvmlite==0.43.0
numba==0.60.0 numba==0.60.0
transformers>=5.8.0 transformers>=5.8.1
bitsandbytes bitsandbytes
sentence-transformers==5.4.0 sentence-transformers==5.5.0
diffusers diffusers
soundfile soundfile
protobuf==6.33.5 protobuf==6.33.5

View File

@@ -2,9 +2,9 @@
torch==2.9.0 torch==2.9.0
llvmlite==0.43.0 llvmlite==0.43.0
numba==0.60.0 numba==0.60.0
transformers>=5.8.0 transformers>=5.8.1
bitsandbytes bitsandbytes
sentence-transformers==5.4.0 sentence-transformers==5.5.0
diffusers diffusers
soundfile soundfile
protobuf==6.33.5 protobuf==6.33.5

View File

@@ -1,11 +1,11 @@
--extra-index-url https://download.pytorch.org/whl/rocm7.0 --extra-index-url https://download.pytorch.org/whl/rocm7.0
torch==2.10.0+rocm7.0 torch==2.10.0+rocm7.0
accelerate accelerate
transformers>=5.8.0 transformers>=5.8.1
llvmlite==0.43.0 llvmlite==0.43.0
numba==0.60.0 numba==0.60.0
bitsandbytes bitsandbytes
sentence-transformers==5.4.0 sentence-transformers==5.5.0
diffusers diffusers
soundfile soundfile
protobuf==6.33.5 protobuf==6.33.5

View File

@@ -3,9 +3,9 @@ torch
optimum[openvino] optimum[openvino]
llvmlite==0.43.0 llvmlite==0.43.0
numba==0.60.0 numba==0.60.0
transformers>=5.8.0 transformers>=5.8.1
bitsandbytes bitsandbytes
sentence-transformers==5.4.0 sentence-transformers==5.5.0
diffusers diffusers
soundfile soundfile
protobuf==6.33.5 protobuf==6.33.5

View File

@@ -2,9 +2,9 @@ torch==2.7.1
llvmlite==0.43.0 llvmlite==0.43.0
numba==0.60.0 numba==0.60.0
accelerate accelerate
transformers>=5.8.0 transformers>=5.8.1
bitsandbytes bitsandbytes
sentence-transformers==5.4.0 sentence-transformers==5.5.0
diffusers diffusers
soundfile soundfile
protobuf==6.33.5 protobuf==6.33.5

View File

@@ -13,14 +13,14 @@ else
fi fi
# Handle l4t build profiles (Python 3.12, pip fallback) if needed. # Handle l4t build profiles (Python 3.12, pip fallback) if needed.
# unsafe-best-match is required on l4t13 because the jetson-ai-lab index # Since PyTorch 2.11 (April 2026) PyPI ships aarch64 + cu130 manylinux wheels
# lists transitive deps at limited versions — without it uv pins to the # directly for torch/torchvision/torchaudio and an aarch64 vllm wheel pinned
# first matching index and fails to resolve a compatible wheel from PyPI. # to that torch, so the jetson-ai-lab mirror is no longer needed.
# https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
PYTHON_VERSION="3.12" PYTHON_VERSION="3.12"
PYTHON_PATCH="12" PYTHON_PATCH="12"
PY_STANDALONE_TAG="20251120" PY_STANDALONE_TAG="20251120"
EXTRA_PIP_INSTALL_FLAGS="${EXTRA_PIP_INSTALL_FLAGS:-} --index-strategy=unsafe-best-match"
fi fi
if [ "x${BUILD_PROFILE}" == "xl4t12" ]; then if [ "x${BUILD_PROFILE}" == "xl4t12" ]; then
@@ -42,18 +42,11 @@ if [ "x${BUILD_TYPE}" == "xhipblas" ]; then
else else
uv pip install vllm==0.14.0 --extra-index-url https://wheels.vllm.ai/rocm/0.14.0/rocm700 uv pip install vllm==0.14.0 --extra-index-url https://wheels.vllm.ai/rocm/0.14.0/rocm700
fi fi
elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then elif [ "x${BUILD_PROFILE}" == "xcublas13" ] || [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
# JetPack 7 / L4T arm64 cu130 — vllm comes from the prebuilt SBSA wheel # cublas13 (x86_64) and l4t13 (aarch64) both pull vllm from PyPI now:
# at jetson-ai-lab. Version is unpinned: the index ships whatever build # vllm 0.19+ defaults to cu130 wheels on x86_64 and vllm 0.20+ ships an
# matches the cu130/cp312 ABI. unsafe-best-match lets uv fall through # aarch64 manylinux wheel pinned to torch==2.11.0. No extra index needed
# to PyPI for transitive deps not present on the jetson-ai-lab index. # in either case.
if [ "x${USE_PIP}" == "xtrue" ]; then
pip install vllm --extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
else
uv pip install --index-strategy=unsafe-best-match vllm --extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
fi
elif [ "x${BUILD_PROFILE}" == "xcublas13" ]; then
# vllm 0.19+ defaults to cu130 wheels on PyPI, no extra index needed.
if [ "x${USE_PIP}" == "xtrue" ]; then if [ "x${USE_PIP}" == "xtrue" ]; then
pip install vllm --torch-backend=auto pip install vllm --torch-backend=auto
else else

View File

@@ -1,11 +1,15 @@
--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130 # JetPack 7 / L4T arm64 + CUDA 13. PyPI ships aarch64 + cu130 manylinux wheels
# for torch/torchvision/torchaudio directly since PyTorch 2.11 (April 2026),
# so no custom index is needed. flash-attn is dropped here: PyPI has no
# aarch64 wheel for it, but vLLM 0.20+ bundles its own vllm_flash_attn
# (fa2 + fa3) inside the main wheel, so it is not required at runtime.
# https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
accelerate accelerate
torch torch
torchvision torchvision
torchaudio torchaudio
transformers transformers
bitsandbytes bitsandbytes
flash-attn
diffusers diffusers
librosa librosa
soundfile soundfile

View File

@@ -43,14 +43,11 @@ if [ "x${BUILD_PROFILE}" == "xcublas13" ]; then
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match" EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
fi fi
# JetPack 7 / L4T arm64 wheels (torch, vllm, flash-attn) live on # JetPack 7 / L4T arm64 vllm + torch wheels come straight from PyPI now
# pypi.jetson-ai-lab.io and are built for cp312, so bump the venv Python # (torch 2.11+ ships aarch64 + cu130 manylinux wheels and vllm 0.20+ ships
# accordingly. JetPack 6 keeps cp310 + USE_PIP=true. # an aarch64 wheel pinned to that torch). They're cp312-only, so bump the
# # venv Python accordingly. JetPack 6 keeps cp310 + USE_PIP=true.
# l4t13 uses pyproject.toml (see the elif branch below) to pin only the # https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
# L4T-specific wheels to the jetson-ai-lab index via [tool.uv.sources].
# That keeps PyPI as the resolution path for transitive deps like
# anthropic/openai/propcache, which the L4T mirror's proxy 503s on.
if [ "x${BUILD_PROFILE}" == "xl4t12" ]; then if [ "x${BUILD_PROFILE}" == "xl4t12" ]; then
USE_PIP=true USE_PIP=true
fi fi
@@ -103,25 +100,6 @@ if [ "x${BUILD_TYPE}" == "xintel" ]; then
export CMAKE_PREFIX_PATH="$(python -c 'import site; print(site.getsitepackages()[0])'):${CMAKE_PREFIX_PATH:-}" export CMAKE_PREFIX_PATH="$(python -c 'import site; print(site.getsitepackages()[0])'):${CMAKE_PREFIX_PATH:-}"
VLLM_TARGET_DEVICE=xpu uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --no-deps . VLLM_TARGET_DEVICE=xpu uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --no-deps .
popd popd
# L4T arm64 (JetPack 7): drive the install through pyproject.toml so that
# [tool.uv.sources] can pin torch/vllm/flash-attn/torchvision/torchaudio
# to the jetson-ai-lab index, while everything else (transitive deps and
# PyPI-resolvable packages like transformers) comes from PyPI. Bypasses
# installRequirements because uv pip install -r requirements.txt does not
# honor sources — see backend/python/vllm/pyproject.toml for the rationale.
elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
ensureVenv
if [ "x${PORTABLE_PYTHON}" == "xtrue" ]; then
export C_INCLUDE_PATH="${C_INCLUDE_PATH:-}:$(_portable_dir)/include/python${PYTHON_VERSION}"
fi
pushd "${backend_dir}"
# Build deps first (matches installRequirements' requirements-install.txt
# pass — fastsafetensors and friends need pybind11 in the venv before
# their sdists can build under --no-build-isolation).
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} -r requirements-install.txt
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --requirement pyproject.toml
popd
runProtogen
# FROM_SOURCE=true on a CPU build skips the prebuilt vllm wheel in # FROM_SOURCE=true on a CPU build skips the prebuilt vllm wheel in
# requirements-cpu-after.txt and compiles vllm locally against the host's # requirements-cpu-after.txt and compiles vllm locally against the host's
# actual CPU. Not used by default because it takes ~30-40 minutes, but # actual CPU. Not used by default because it takes ~30-40 minutes, but

View File

@@ -1,61 +0,0 @@
# L4T arm64 (JetPack 7 / sbsa cu130) install spec for the vllm backend.
#
# Why this file exists, and why only the l4t13 BUILD_PROFILE consumes it:
#
# pypi.jetson-ai-lab.io hosts the L4T-specific torch / vllm / flash-attn
# wheels we need on aarch64 + cuda13, but it ALSO transparently proxies the
# rest of PyPI through `/+f/<sha>/<filename>` URLs that 503 frequently. With
# `--extra-index-url` + `--index-strategy=unsafe-best-match` (the historical
# fix in install.sh) uv would pick those proxy URLs for ordinary PyPI
# packages — `anthropic`, `openai`, `propcache`, `annotated-types` — and
# trip on the 503s. See e.g. CI run 25212201349 (anthropic-0.97.0).
#
# `explicit = true` on the index makes uv consult the L4T mirror ONLY for
# packages mapped under [tool.uv.sources]. Everything else goes to PyPI.
# This breaks the historical 503 path without losing access to the L4T
# wheels we actually need from there.
#
# `uv pip install -r requirements.txt` does NOT honor [tool.uv.sources]
# (sources are project-mode only, not pip-compat mode), so install.sh's
# l4t13 branch invokes `uv pip install --requirement pyproject.toml`
# directly. Other BUILD_PROFILEs continue to use the requirements-*.txt
# pipeline through libbackend.sh's installRequirements and never read
# this file.
[project]
name = "localai-vllm-l4t13"
version = "0.0.0"
requires-python = ">=3.12,<3.13"
dependencies = [
# Mirror of requirements.txt — kept in sync manually for now since the
# l4t13 path bypasses installRequirements (see install.sh).
"grpcio==1.80.0",
"protobuf",
"certifi",
"setuptools",
"pillow",
"charset-normalizer>=3.4.7",
"chardet",
# L4T-specific accelerator stack (sourced from jetson-ai-lab below).
"torch",
"torchvision",
"torchaudio",
"flash-attn",
"vllm",
# PyPI-resolvable packages that complete the runtime — accelerate,
# transformers, bitsandbytes carry their own wheels for aarch64.
"accelerate",
"transformers",
"bitsandbytes",
]
[[tool.uv.index]]
name = "jetson-ai-lab"
url = "https://pypi.jetson-ai-lab.io/sbsa/cu130"
explicit = true
[tool.uv.sources]
torch = { index = "jetson-ai-lab" }
torchvision = { index = "jetson-ai-lab" }
torchaudio = { index = "jetson-ai-lab" }
flash-attn = { index = "jetson-ai-lab" }
vllm = { index = "jetson-ai-lab" }

View File

@@ -3,5 +3,5 @@
# on a cu130 host. Pull the cu130-flavoured wheel from vLLM's per-tag index # on a cu130 host. Pull the cu130-flavoured wheel from vLLM's per-tag index
# instead — the cublas13 case in install.sh adds --index-strategy=unsafe-best-match # instead — the cublas13 case in install.sh adds --index-strategy=unsafe-best-match
# so uv consults this index alongside PyPI. # so uv consults this index alongside PyPI.
--extra-index-url https://wheels.vllm.ai/0.20.2/cu130 --extra-index-url https://wheels.vllm.ai/0.21.0/cu130
vllm==0.20.2 vllm==0.21.0

View File

@@ -0,0 +1,4 @@
# vLLM 0.20+ ships an aarch64 manylinux wheel on PyPI whose Requires-Dist pins
# torch==2.11.0 / torchvision==0.26.0 / torchaudio==2.11.0, locking an ABI-
# consistent set with the cu130 torch wheel installed above.
vllm

View File

@@ -0,0 +1,8 @@
# JetPack 7 / L4T arm64 + CUDA 13. Since PyTorch 2.11 (April 2026), PyPI ships
# aarch64 + cu130 manylinux wheels for torch/torchvision/torchaudio directly,
# so we no longer need a custom --extra-index-url for the L4T mirror.
# https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
accelerate
torch
transformers
bitsandbytes

View File

@@ -169,7 +169,7 @@ func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB, configLoade
cfg.Distributed.HealthCheckIntervalOrDefault(), cfg.Distributed.HealthCheckIntervalOrDefault(),
cfg.Distributed.StaleNodeThresholdOrDefault(), cfg.Distributed.StaleNodeThresholdOrDefault(),
routerAuthToken, routerAuthToken,
cfg.Distributed.PerModelHealthCheck, !cfg.Distributed.DisablePerModelHealthCheck,
) )
// Initialize job store // Initialize job store
@@ -233,7 +233,12 @@ func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB, configLoade
xlog.Info("File stager initialized (HTTP direct transfer)") xlog.Info("File stager initialized (HTTP direct transfer)")
} }
// Create RemoteUnloaderAdapter — needed by SmartRouter and startup.go // Create RemoteUnloaderAdapter — needed by SmartRouter and startup.go
remoteUnloader := nodes.NewRemoteUnloaderAdapter(registry, natsClient) remoteUnloader := nodes.NewRemoteUnloaderAdapter(
registry,
natsClient,
cfg.Distributed.BackendInstallTimeoutOrDefault(),
cfg.Distributed.BackendUpgradeTimeoutOrDefault(),
)
// All dependencies ready — build SmartRouter with all options at once // All dependencies ready — build SmartRouter with all options at once
var conflictResolver nodes.ConcurrencyConflictResolver var conflictResolver nodes.ConcurrencyConflictResolver

View File

@@ -17,9 +17,9 @@ import (
"github.com/mudler/LocalAI/core/services/jobs" "github.com/mudler/LocalAI/core/services/jobs"
"github.com/mudler/LocalAI/core/services/nodes" "github.com/mudler/LocalAI/core/services/nodes"
"github.com/mudler/LocalAI/core/services/storage" "github.com/mudler/LocalAI/core/services/storage"
"github.com/mudler/LocalAI/pkg/vram"
coreStartup "github.com/mudler/LocalAI/core/startup" coreStartup "github.com/mudler/LocalAI/core/startup"
"github.com/mudler/LocalAI/internal" "github.com/mudler/LocalAI/internal"
"github.com/mudler/LocalAI/pkg/vram"
"github.com/mudler/LocalAI/pkg/model" "github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/LocalAI/pkg/sanitize" "github.com/mudler/LocalAI/pkg/sanitize"
@@ -200,7 +200,7 @@ func New(opts ...config.AppOption) (*Application, error) {
nodes.NewDistributedModelManager(options, application.modelLoader, distSvc.Unloader), nodes.NewDistributedModelManager(options, application.modelLoader, distSvc.Unloader),
) )
application.galleryService.SetBackendManager( application.galleryService.SetBackendManager(
nodes.NewDistributedBackendManager(options, application.modelLoader, distSvc.Unloader, distSvc.Registry), nodes.NewDistributedBackendManager(options, application.modelLoader, distSvc.Unloader, distSvc.Registry, application.galleryService),
) )
} }
} }
@@ -212,12 +212,12 @@ func New(opts ...config.AppOption) (*Application, error) {
} }
} }
if err := coreStartup.InstallModels(options.Context, application.GalleryService(), options.Galleries, options.BackendGalleries, options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, nil, options.ModelsURL...); err != nil { if err := coreStartup.InstallModels(options.Context, application.GalleryService(), options.Galleries, options.BackendGalleries, options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.RequireBackendIntegrity, nil, options.ModelsURL...); err != nil {
xlog.Error("error installing models", "error", err) xlog.Error("error installing models", "error", err)
} }
for _, backend := range options.ExternalBackends { for _, backend := range options.ExternalBackends {
if err := galleryop.InstallExternalBackend(options.Context, options.BackendGalleries, options.SystemState, application.ModelLoader(), nil, backend, "", ""); err != nil { if err := galleryop.InstallExternalBackend(options.Context, options.BackendGalleries, options.SystemState, application.ModelLoader(), nil, backend, "", "", options.RequireBackendIntegrity); err != nil {
xlog.Error("error installing external backend", "error", err) xlog.Error("error installing external backend", "error", err)
} }
} }
@@ -267,13 +267,13 @@ func New(opts ...config.AppOption) (*Application, error) {
} }
if options.PreloadJSONModels != "" { if options.PreloadJSONModels != "" {
if err := galleryop.ApplyGalleryFromString(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadJSONModels); err != nil { if err := galleryop.ApplyGalleryFromString(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadJSONModels, options.RequireBackendIntegrity); err != nil {
return nil, err return nil, err
} }
} }
if options.PreloadModelsFromPath != "" { if options.PreloadModelsFromPath != "" {
if err := galleryop.ApplyGalleryFromFile(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadModelsFromPath); err != nil { if err := galleryop.ApplyGalleryFromFile(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadModelsFromPath, options.RequireBackendIntegrity); err != nil {
return nil, err return nil, err
} }
} }
@@ -552,6 +552,13 @@ func loadRuntimeSettingsFromFile(options *config.ApplicationConfig) {
options.TracingMaxItems = *settings.TracingMaxItems options.TracingMaxItems = *settings.TracingMaxItems
} }
} }
if settings.TracingMaxBodyBytes != nil {
// Allow the on-disk setting to override the CLI/env default. The
// startup default is non-zero (see NewApplicationConfig), so a plain
// `== 0` guard like the others would never trigger; we instead respect
// any value the file specifies. 0 in the file means "uncapped".
options.TracingMaxBodyBytes = *settings.TracingMaxBodyBytes
}
// Branding / whitelabeling. There are no env vars for these — the file is // Branding / whitelabeling. There are no env vars for these — the file is
// the only source — so apply unconditionally. Without this block a server // the only source — so apply unconditionally. Without this block a server

View File

@@ -217,7 +217,7 @@ func (uc *UpgradeChecker) runCheck(ctx context.Context) {
err = bm.UpgradeBackend(ctx, name, nil) err = bm.UpgradeBackend(ctx, name, nil)
} else { } else {
err = gallery.UpgradeBackend(ctx, uc.systemState, uc.modelLoader, err = gallery.UpgradeBackend(ctx, uc.systemState, uc.modelLoader,
uc.galleries, name, nil) uc.galleries, name, nil, uc.appConfig.RequireBackendIntegrity)
} }
if err != nil { if err != nil {
xlog.Error("Failed to auto-upgrade backend", xlog.Error("Failed to auto-upgrade backend",

View File

@@ -86,7 +86,7 @@ func ModelInference(ctx context.Context, s string, messages schema.Messages, ima
if !slices.Contains(modelNames, modelName) { if !slices.Contains(modelNames, modelName) {
utils.ResetDownloadTimers() utils.ResetDownloadTimers()
// if we failed to load the model, we try to download it // if we failed to load the model, we try to download it
err := gallery.InstallModelFromGallery(ctx, o.Galleries, o.BackendGalleries, o.SystemState, loader, modelName, gallery.GalleryModel{}, utils.DisplayDownloadFunction, o.EnforcePredownloadScans, o.AutoloadBackendGalleries) err := gallery.InstallModelFromGallery(ctx, o.Galleries, o.BackendGalleries, o.SystemState, loader, modelName, gallery.GalleryModel{}, utils.DisplayDownloadFunction, o.EnforcePredownloadScans, o.AutoloadBackendGalleries, o.RequireBackendIntegrity)
if err != nil { if err != nil {
xlog.Error("failed to install model from gallery", "error", err, "model", modelFile) xlog.Error("failed to install model from gallery", "error", err, "model", modelFile)
//return nil, err //return nil, err

View File

@@ -277,7 +277,7 @@ func gRPCPredictOpts(c config.ModelConfig, modelPath string) *pb.PredictOptions
MinP: float32(*c.MinP), MinP: float32(*c.MinP),
Tokens: int32(*c.Maxtokens), Tokens: int32(*c.Maxtokens),
Threads: int32(*c.Threads), Threads: int32(*c.Threads),
PromptCacheAll: c.PromptCacheAll, PromptCacheAll: *c.PromptCacheAll,
PromptCacheRO: c.PromptCacheRO, PromptCacheRO: c.PromptCacheRO,
PromptCachePath: promptCachePath, PromptCachePath: promptCachePath,
F16KV: *c.F16, F16KV: *c.F16,

View File

@@ -17,9 +17,10 @@ import (
) )
type BackendsCMDFlags struct { type BackendsCMDFlags struct {
BackendGalleries string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"` BackendGalleries string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
BackendsPath string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"storage"` BackendsPath string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"storage"`
BackendsSystemPath string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"` BackendsSystemPath string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
RequireBackendIntegrity bool `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, reject backend installs without a configured signature verification policy (OCI URIs) or SHA256 (tarball/HTTP URIs)." group:"hardening" default:"false"`
} }
type BackendsList struct { type BackendsList struct {
@@ -126,7 +127,7 @@ func (bi *BackendsInstall) Run(ctx *cliContext.Context) error {
} }
modelLoader := model.NewModelLoader(systemState) modelLoader := model.NewModelLoader(systemState)
err = galleryop.InstallExternalBackend(context.Background(), galleries, systemState, modelLoader, progressCallback, bi.BackendArgs, bi.Name, bi.Alias) err = galleryop.InstallExternalBackend(context.Background(), galleries, systemState, modelLoader, progressCallback, bi.BackendArgs, bi.Name, bi.Alias, bi.RequireBackendIntegrity)
if err != nil { if err != nil {
return err return err
} }
@@ -197,7 +198,7 @@ func (bu *BackendsUpgrade) Run(ctx *cliContext.Context) error {
} }
} }
if err := gallery.UpgradeBackend(context.Background(), systemState, modelLoader, galleries, name, progressCallback); err != nil { if err := gallery.UpgradeBackend(context.Background(), systemState, modelLoader, galleries, name, progressCallback, bu.RequireBackendIntegrity); err != nil {
fmt.Printf("Failed to upgrade %s: %v\n", name, err) fmt.Printf("Failed to upgrade %s: %v\n", name, err)
} else { } else {
fmt.Printf("Backend %s upgraded successfully\n", name) fmt.Printf("Backend %s upgraded successfully\n", name)

View File

@@ -32,6 +32,7 @@ type ModelsList struct {
type ModelsInstall struct { type ModelsInstall struct {
DisablePredownloadScan bool `env:"LOCALAI_DISABLE_PREDOWNLOAD_SCAN" help:"If true, disables the best-effort security scanner before downloading any files." group:"hardening" default:"false"` DisablePredownloadScan bool `env:"LOCALAI_DISABLE_PREDOWNLOAD_SCAN" help:"If true, disables the best-effort security scanner before downloading any files." group:"hardening" default:"false"`
RequireBackendIntegrity bool `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, reject backend installs without a configured signature verification policy (OCI URIs) or SHA256 (tarball/HTTP URIs)." group:"hardening" default:"false"`
AutoloadBackendGalleries bool `env:"LOCALAI_AUTOLOAD_BACKEND_GALLERIES" help:"If true, automatically loads backend galleries" group:"backends" default:"true"` AutoloadBackendGalleries bool `env:"LOCALAI_AUTOLOAD_BACKEND_GALLERIES" help:"If true, automatically loads backend galleries" group:"backends" default:"true"`
ModelArgs []string `arg:"" optional:"" name:"models" help:"Model configuration URLs to load"` ModelArgs []string `arg:"" optional:"" name:"models" help:"Model configuration URLs to load"`
@@ -71,7 +72,6 @@ func (ml *ModelsList) Run(ctx *cliContext.Context) error {
} }
func (mi *ModelsInstall) Run(ctx *cliContext.Context) error { func (mi *ModelsInstall) Run(ctx *cliContext.Context) error {
systemState, err := system.GetSystemState( systemState, err := system.GetSystemState(
system.WithModelPath(mi.ModelsPath), system.WithModelPath(mi.ModelsPath),
system.WithBackendPath(mi.BackendsPath), system.WithBackendPath(mi.BackendsPath),
@@ -135,7 +135,7 @@ func (mi *ModelsInstall) Run(ctx *cliContext.Context) error {
} }
modelLoader := model.NewModelLoader(systemState) modelLoader := model.NewModelLoader(systemState)
err = startup.InstallModels(context.Background(), galleryService, galleries, backendGalleries, systemState, modelLoader, !mi.DisablePredownloadScan, mi.AutoloadBackendGalleries, progressCallback, modelName) err = startup.InstallModels(context.Background(), galleryService, galleries, backendGalleries, systemState, modelLoader, !mi.DisablePredownloadScan, mi.AutoloadBackendGalleries, mi.RequireBackendIntegrity, progressCallback, modelName)
if err != nil { if err != nil {
return err return err
} }

View File

@@ -39,19 +39,19 @@ type RunCMD struct {
LocalaiConfigDir string `env:"LOCALAI_CONFIG_DIR" type:"path" default:"${basepath}/configuration" help:"Directory for dynamic loading of certain configuration files (currently api_keys.json and external_backends.json)" group:"storage"` LocalaiConfigDir string `env:"LOCALAI_CONFIG_DIR" type:"path" default:"${basepath}/configuration" help:"Directory for dynamic loading of certain configuration files (currently api_keys.json and external_backends.json)" group:"storage"`
LocalaiConfigDirPollInterval time.Duration `env:"LOCALAI_CONFIG_DIR_POLL_INTERVAL" help:"Typically the config path picks up changes automatically, but if your system has broken fsnotify events, set this to an interval to poll the LocalAI Config Dir (example: 1m)" group:"storage"` LocalaiConfigDirPollInterval time.Duration `env:"LOCALAI_CONFIG_DIR_POLL_INTERVAL" help:"Typically the config path picks up changes automatically, but if your system has broken fsnotify events, set this to an interval to poll the LocalAI Config Dir (example: 1m)" group:"storage"`
// The alias on this option is there to preserve functionality with the old `--config-file` parameter // The alias on this option is there to preserve functionality with the old `--config-file` parameter
ModelsConfigFile string `env:"LOCALAI_MODELS_CONFIG_FILE,CONFIG_FILE" aliases:"config-file" help:"YAML file containing a list of model backend configs" group:"storage"` ModelsConfigFile string `env:"LOCALAI_MODELS_CONFIG_FILE,CONFIG_FILE" aliases:"config-file" help:"YAML file containing a list of model backend configs" group:"storage"`
BackendGalleries string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"` BackendGalleries string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
Galleries string `env:"LOCALAI_GALLERIES,GALLERIES" help:"JSON list of galleries" group:"models" default:"${galleries}"` Galleries string `env:"LOCALAI_GALLERIES,GALLERIES" help:"JSON list of galleries" group:"models" default:"${galleries}"`
AutoloadGalleries bool `env:"LOCALAI_AUTOLOAD_GALLERIES,AUTOLOAD_GALLERIES" group:"models" default:"true"` AutoloadGalleries bool `env:"LOCALAI_AUTOLOAD_GALLERIES,AUTOLOAD_GALLERIES" group:"models" default:"true"`
AutoloadBackendGalleries bool `env:"LOCALAI_AUTOLOAD_BACKEND_GALLERIES,AUTOLOAD_BACKEND_GALLERIES" group:"backends" default:"true"` AutoloadBackendGalleries bool `env:"LOCALAI_AUTOLOAD_BACKEND_GALLERIES,AUTOLOAD_BACKEND_GALLERIES" group:"backends" default:"true"`
BackendImagesReleaseTag string `env:"LOCALAI_BACKEND_IMAGES_RELEASE_TAG,BACKEND_IMAGES_RELEASE_TAG" help:"Fallback release tag for backend images" group:"backends" default:"latest"` BackendImagesReleaseTag string `env:"LOCALAI_BACKEND_IMAGES_RELEASE_TAG,BACKEND_IMAGES_RELEASE_TAG" help:"Fallback release tag for backend images" group:"backends" default:"latest"`
BackendImagesBranchTag string `env:"LOCALAI_BACKEND_IMAGES_BRANCH_TAG,BACKEND_IMAGES_BRANCH_TAG" help:"Fallback branch tag for backend images" group:"backends" default:"master"` BackendImagesBranchTag string `env:"LOCALAI_BACKEND_IMAGES_BRANCH_TAG,BACKEND_IMAGES_BRANCH_TAG" help:"Fallback branch tag for backend images" group:"backends" default:"master"`
BackendDevSuffix string `env:"LOCALAI_BACKEND_DEV_SUFFIX,BACKEND_DEV_SUFFIX" help:"Development suffix for backend images" group:"backends" default:"development"` BackendDevSuffix string `env:"LOCALAI_BACKEND_DEV_SUFFIX,BACKEND_DEV_SUFFIX" help:"Development suffix for backend images" group:"backends" default:"development"`
AutoUpgradeBackends bool `env:"LOCALAI_AUTO_UPGRADE_BACKENDS,AUTO_UPGRADE_BACKENDS" help:"Automatically upgrade backends when new versions are detected" group:"backends" default:"false"` AutoUpgradeBackends bool `env:"LOCALAI_AUTO_UPGRADE_BACKENDS,AUTO_UPGRADE_BACKENDS" help:"Automatically upgrade backends when new versions are detected" group:"backends" default:"false"`
PreferDevelopmentBackends bool `env:"LOCALAI_PREFER_DEV_BACKENDS,PREFER_DEV_BACKENDS" help:"Prefer development backend versions (shows development backends by default in UI)" group:"backends" default:"false"` PreferDevelopmentBackends bool `env:"LOCALAI_PREFER_DEV_BACKENDS,PREFER_DEV_BACKENDS" help:"Prefer development backend versions (shows development backends by default in UI)" group:"backends" default:"false"`
PreloadModels string `env:"LOCALAI_PRELOAD_MODELS,PRELOAD_MODELS" help:"A List of models to apply in JSON at start" group:"models"` PreloadModels string `env:"LOCALAI_PRELOAD_MODELS,PRELOAD_MODELS" help:"A List of models to apply in JSON at start" group:"models"`
Models []string `env:"LOCALAI_MODELS,MODELS" help:"A List of model configuration URLs to load" group:"models"` Models []string `env:"LOCALAI_MODELS,MODELS" help:"A List of model configuration URLs to load" group:"models"`
PreloadModelsConfig string `env:"LOCALAI_PRELOAD_MODELS_CONFIG,PRELOAD_MODELS_CONFIG" help:"A List of models to apply at startup. Path to a YAML config file" group:"models"` PreloadModelsConfig string `env:"LOCALAI_PRELOAD_MODELS_CONFIG,PRELOAD_MODELS_CONFIG" help:"A List of models to apply at startup. Path to a YAML config file" group:"models"`
F16 bool `name:"f16" env:"LOCALAI_F16,F16" help:"Enable GPU acceleration" group:"performance"` F16 bool `name:"f16" env:"LOCALAI_F16,F16" help:"Enable GPU acceleration" group:"performance"`
Threads int `env:"LOCALAI_THREADS,THREADS" short:"t" help:"Number of threads used for parallel computation. Usage of the number of physical cores in the system is suggested" group:"performance"` Threads int `env:"LOCALAI_THREADS,THREADS" short:"t" help:"Number of threads used for parallel computation. Usage of the number of physical cores in the system is suggested" group:"performance"`
@@ -67,6 +67,7 @@ type RunCMD struct {
OllamaAPIRootEndpoint bool `env:"LOCALAI_OLLAMA_API_ROOT_ENDPOINT" default:"false" help:"Register Ollama-compatible health check on / (replaces web UI on root path). The /api/* Ollama endpoints are always available regardless of this flag" group:"api"` OllamaAPIRootEndpoint bool `env:"LOCALAI_OLLAMA_API_ROOT_ENDPOINT" default:"false" help:"Register Ollama-compatible health check on / (replaces web UI on root path). The /api/* Ollama endpoints are always available regardless of this flag" group:"api"`
DisableRuntimeSettings bool `env:"LOCALAI_DISABLE_RUNTIME_SETTINGS,DISABLE_RUNTIME_SETTINGS" default:"false" help:"Disables the runtime settings. When set to true, the server will not load the runtime settings from the runtime_settings.json file" group:"api"` DisableRuntimeSettings bool `env:"LOCALAI_DISABLE_RUNTIME_SETTINGS,DISABLE_RUNTIME_SETTINGS" default:"false" help:"Disables the runtime settings. When set to true, the server will not load the runtime settings from the runtime_settings.json file" group:"api"`
DisablePredownloadScan bool `env:"LOCALAI_DISABLE_PREDOWNLOAD_SCAN" help:"If true, disables the best-effort security scanner before downloading any files." group:"hardening" default:"false"` DisablePredownloadScan bool `env:"LOCALAI_DISABLE_PREDOWNLOAD_SCAN" help:"If true, disables the best-effort security scanner before downloading any files." group:"hardening" default:"false"`
RequireBackendIntegrity bool `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, backend installs without a configured signature verification policy (for OCI URIs) or SHA256 (for tarball/HTTP URIs) are rejected. Default is to warn and install. Set this in production once your gallery's verification: block is populated." group:"hardening" default:"false"`
OpaqueErrors bool `env:"LOCALAI_OPAQUE_ERRORS" default:"false" help:"If true, all error responses are replaced with blank 500 errors. This is intended only for hardening against information leaks and is normally not recommended." group:"hardening"` OpaqueErrors bool `env:"LOCALAI_OPAQUE_ERRORS" default:"false" help:"If true, all error responses are replaced with blank 500 errors. This is intended only for hardening against information leaks and is normally not recommended." group:"hardening"`
UseSubtleKeyComparison bool `env:"LOCALAI_SUBTLE_KEY_COMPARISON" default:"false" help:"If true, API Key validation comparisons will be performed using constant-time comparisons rather than simple equality. This trades off performance on each request for resiliancy against timing attacks." group:"hardening"` UseSubtleKeyComparison bool `env:"LOCALAI_SUBTLE_KEY_COMPARISON" default:"false" help:"If true, API Key validation comparisons will be performed using constant-time comparisons rather than simple equality. This trades off performance on each request for resiliancy against timing attacks." group:"hardening"`
DisableApiKeyRequirementForHttpGet bool `env:"LOCALAI_DISABLE_API_KEY_REQUIREMENT_FOR_HTTP_GET" default:"false" help:"If true, a valid API key is not required to issue GET requests to portions of the web ui. This should only be enabled in secure testing environments" group:"hardening"` DisableApiKeyRequirementForHttpGet bool `env:"LOCALAI_DISABLE_API_KEY_REQUIREMENT_FOR_HTTP_GET" default:"false" help:"If true, a valid API key is not required to issue GET requests to portions of the web ui. This should only be enabled in secure testing environments" group:"hardening"`
@@ -99,6 +100,7 @@ type RunCMD struct {
LoadToMemory []string `env:"LOCALAI_LOAD_TO_MEMORY,LOAD_TO_MEMORY" help:"A list of models to load into memory at startup" group:"models"` LoadToMemory []string `env:"LOCALAI_LOAD_TO_MEMORY,LOAD_TO_MEMORY" help:"A list of models to load into memory at startup" group:"models"`
EnableTracing bool `env:"LOCALAI_ENABLE_TRACING,ENABLE_TRACING" help:"Enable API tracing" group:"api"` EnableTracing bool `env:"LOCALAI_ENABLE_TRACING,ENABLE_TRACING" help:"Enable API tracing" group:"api"`
TracingMaxItems int `env:"LOCALAI_TRACING_MAX_ITEMS" default:"1024" help:"Maximum number of traces to keep" group:"api"` TracingMaxItems int `env:"LOCALAI_TRACING_MAX_ITEMS" default:"1024" help:"Maximum number of traces to keep" group:"api"`
TracingMaxBodyBytes int `env:"LOCALAI_TRACING_MAX_BODY_BYTES" default:"65536" help:"Maximum bytes captured per request/response body in the trace buffer (0 = uncapped). Caps memory growth from chatty endpoints like /embeddings." group:"api"`
AgentJobRetentionDays int `env:"LOCALAI_AGENT_JOB_RETENTION_DAYS,AGENT_JOB_RETENTION_DAYS" default:"30" help:"Number of days to keep agent job history (default: 30)" group:"api"` AgentJobRetentionDays int `env:"LOCALAI_AGENT_JOB_RETENTION_DAYS,AGENT_JOB_RETENTION_DAYS" default:"30" help:"Number of days to keep agent job history (default: 30)" group:"api"`
OpenResponsesStoreTTL string `env:"LOCALAI_OPEN_RESPONSES_STORE_TTL,OPEN_RESPONSES_STORE_TTL" default:"0" help:"TTL for Open Responses store (e.g., 1h, 30m, 0 = no expiration)" group:"api"` OpenResponsesStoreTTL string `env:"LOCALAI_OPEN_RESPONSES_STORE_TTL,OPEN_RESPONSES_STORE_TTL" default:"0" help:"TTL for Open Responses store (e.g., 1h, 30m, 0 = no expiration)" group:"api"`
@@ -143,16 +145,18 @@ type RunCMD struct {
DefaultAPIKeyExpiry string `env:"LOCALAI_DEFAULT_API_KEY_EXPIRY" help:"Default expiry for API keys (e.g. 90d, 1y; empty = no expiry)" group:"auth"` DefaultAPIKeyExpiry string `env:"LOCALAI_DEFAULT_API_KEY_EXPIRY" help:"Default expiry for API keys (e.g. 90d, 1y; empty = no expiry)" group:"auth"`
// Distributed / Horizontal Scaling // Distributed / Horizontal Scaling
Distributed bool `env:"LOCALAI_DISTRIBUTED" default:"false" help:"Enable distributed mode (requires PostgreSQL + NATS)" group:"distributed"` Distributed bool `env:"LOCALAI_DISTRIBUTED" default:"false" help:"Enable distributed mode (requires PostgreSQL + NATS)" group:"distributed"`
InstanceID string `env:"LOCALAI_INSTANCE_ID" help:"Unique instance ID for distributed mode (auto-generated UUID if empty)" group:"distributed"` InstanceID string `env:"LOCALAI_INSTANCE_ID" help:"Unique instance ID for distributed mode (auto-generated UUID if empty)" group:"distributed"`
NatsURL string `env:"LOCALAI_NATS_URL" help:"NATS server URL (e.g., nats://localhost:4222)" group:"distributed"` NatsURL string `env:"LOCALAI_NATS_URL" help:"NATS server URL (e.g., nats://localhost:4222)" group:"distributed"`
StorageURL string `env:"LOCALAI_STORAGE_URL" help:"S3-compatible storage endpoint URL (e.g., http://minio:9000)" group:"distributed"` StorageURL string `env:"LOCALAI_STORAGE_URL" help:"S3-compatible storage endpoint URL (e.g., http://minio:9000)" group:"distributed"`
StorageBucket string `env:"LOCALAI_STORAGE_BUCKET" default:"localai" help:"S3 bucket name for object storage" group:"distributed"` StorageBucket string `env:"LOCALAI_STORAGE_BUCKET" default:"localai" help:"S3 bucket name for object storage" group:"distributed"`
StorageRegion string `env:"LOCALAI_STORAGE_REGION" default:"us-east-1" help:"S3 region" group:"distributed"` StorageRegion string `env:"LOCALAI_STORAGE_REGION" default:"us-east-1" help:"S3 region" group:"distributed"`
StorageAccessKey string `env:"LOCALAI_STORAGE_ACCESS_KEY" help:"S3 access key ID" group:"distributed"` StorageAccessKey string `env:"LOCALAI_STORAGE_ACCESS_KEY" help:"S3 access key ID" group:"distributed"`
StorageSecretKey string `env:"LOCALAI_STORAGE_SECRET_KEY" help:"S3 secret access key" group:"distributed"` StorageSecretKey string `env:"LOCALAI_STORAGE_SECRET_KEY" help:"S3 secret access key" group:"distributed"`
RegistrationToken string `env:"LOCALAI_REGISTRATION_TOKEN" help:"Token that backend nodes must provide to register (empty = no auth required)" group:"distributed"` RegistrationToken string `env:"LOCALAI_REGISTRATION_TOKEN" help:"Token that backend nodes must provide to register (empty = no auth required)" group:"distributed"`
AutoApproveNodes bool `env:"LOCALAI_AUTO_APPROVE_NODES" default:"false" help:"Auto-approve new worker nodes (skip admin approval)" group:"distributed"` AutoApproveNodes bool `env:"LOCALAI_AUTO_APPROVE_NODES" default:"false" help:"Auto-approve new worker nodes (skip admin approval)" group:"distributed"`
BackendInstallTimeout string `env:"LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT" help:"NATS round-trip timeout for backend.install requests sent to worker nodes (default 15m). Increase for slow links pulling multi-GB images." group:"distributed"`
BackendUpgradeTimeout string `env:"LOCALAI_NATS_BACKEND_UPGRADE_TIMEOUT" help:"NATS round-trip timeout for backend.upgrade requests (default 15m)." group:"distributed"`
Version bool Version bool
} }
@@ -253,6 +257,20 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
if r.StorageSecretKey != "" { if r.StorageSecretKey != "" {
opts = append(opts, config.WithStorageSecretKey(r.StorageSecretKey)) opts = append(opts, config.WithStorageSecretKey(r.StorageSecretKey))
} }
if r.BackendInstallTimeout != "" {
d, err := time.ParseDuration(r.BackendInstallTimeout)
if err != nil {
return fmt.Errorf("invalid LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT %q: %w", r.BackendInstallTimeout, err)
}
opts = append(opts, config.WithBackendInstallTimeout(d))
}
if r.BackendUpgradeTimeout != "" {
d, err := time.ParseDuration(r.BackendUpgradeTimeout)
if err != nil {
return fmt.Errorf("invalid LOCALAI_NATS_BACKEND_UPGRADE_TIMEOUT %q: %w", r.BackendUpgradeTimeout, err)
}
opts = append(opts, config.WithBackendUpgradeTimeout(d))
}
if r.RegistrationToken != "" { if r.RegistrationToken != "" {
opts = append(opts, config.WithRegistrationToken(r.RegistrationToken)) opts = append(opts, config.WithRegistrationToken(r.RegistrationToken))
} }
@@ -272,6 +290,7 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
opts = append(opts, config.EnableTracing) opts = append(opts, config.EnableTracing)
} }
opts = append(opts, config.WithTracingMaxItems(r.TracingMaxItems)) opts = append(opts, config.WithTracingMaxItems(r.TracingMaxItems))
opts = append(opts, config.WithTracingMaxBodyBytes(r.TracingMaxBodyBytes))
token := "" token := ""
if r.Peer2Peer || r.Peer2PeerToken != "" { if r.Peer2Peer || r.Peer2PeerToken != "" {
@@ -503,6 +522,10 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
opts = append(opts, config.WithAutoUpgradeBackends(r.AutoUpgradeBackends)) opts = append(opts, config.WithAutoUpgradeBackends(r.AutoUpgradeBackends))
} }
if r.RequireBackendIntegrity {
opts = append(opts, config.WithRequireBackendIntegrity(r.RequireBackendIntegrity))
}
if r.PreferDevelopmentBackends { if r.PreferDevelopmentBackends {
opts = append(opts, config.WithPreferDevelopmentBackends(r.PreferDevelopmentBackends)) opts = append(opts, config.WithPreferDevelopmentBackends(r.PreferDevelopmentBackends))
} }

View File

@@ -1,10 +1,11 @@
package worker package worker
type WorkerFlags struct { type WorkerFlags struct {
BackendsPath string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"backends"` BackendsPath string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"backends"`
BackendGalleries string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"` BackendGalleries string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
BackendsSystemPath string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"` BackendsSystemPath string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
ExtraLLamaCPPArgs string `name:"llama-cpp-args" env:"LOCALAI_EXTRA_LLAMA_CPP_ARGS,EXTRA_LLAMA_CPP_ARGS" help:"Extra arguments to pass to llama-cpp-rpc-server"` RequireBackendIntegrity bool `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, reject backend installs without a configured signature verification policy (OCI URIs) or SHA256 (tarball/HTTP URIs)." group:"hardening" default:"false"`
ExtraLLamaCPPArgs string `name:"llama-cpp-args" env:"LOCALAI_EXTRA_LLAMA_CPP_ARGS,EXTRA_LLAMA_CPP_ARGS" help:"Extra arguments to pass to llama-cpp-rpc-server"`
} }
type Worker struct { type Worker struct {

View File

@@ -18,7 +18,7 @@ import (
// installing the backend from the gallery if it isn't present. // installing the backend from the gallery if it isn't present.
// `name` is the gallery entry name (for vLLM the meta entry "vllm" // `name` is the gallery entry name (for vLLM the meta entry "vllm"
// resolves to a platform-specific package via capability lookup). // resolves to a platform-specific package via capability lookup).
func findBackendPath(name, galleries string, systemState *system.SystemState) (string, error) { func findBackendPath(name, galleries string, systemState *system.SystemState, requireIntegrity bool) (string, error) {
backends, err := gallery.ListSystemBackends(systemState) backends, err := gallery.ListSystemBackends(systemState)
if err != nil { if err != nil {
return "", err return "", err
@@ -33,7 +33,7 @@ func findBackendPath(name, galleries string, systemState *system.SystemState) (s
xlog.Error("failed loading galleries", "error", err) xlog.Error("failed loading galleries", "error", err)
return "", err return "", err
} }
if err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, name, nil, true); err != nil { if err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, name, nil, true, requireIntegrity); err != nil {
xlog.Error("backend not found, failed to install it", "name", name, "error", err) xlog.Error("backend not found, failed to install it", "name", name, "error", err)
return "", err return "", err
} }

View File

@@ -27,7 +27,7 @@ const (
llamaCPPGalleryName = "llama-cpp" llamaCPPGalleryName = "llama-cpp"
) )
func findLLamaCPPBackend(galleries string, systemState *system.SystemState) (string, error) { func findLLamaCPPBackend(galleries string, systemState *system.SystemState, requireIntegrity bool) (string, error) {
backends, err := gallery.ListSystemBackends(systemState) backends, err := gallery.ListSystemBackends(systemState)
if err != nil { if err != nil {
xlog.Warn("Failed listing system backends", "error", err) xlog.Warn("Failed listing system backends", "error", err)
@@ -43,7 +43,7 @@ func findLLamaCPPBackend(galleries string, systemState *system.SystemState) (str
xlog.Error("failed loading galleries", "error", err) xlog.Error("failed loading galleries", "error", err)
return "", err return "", err
} }
err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, llamaCPPGalleryName, nil, true) err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, llamaCPPGalleryName, nil, true, requireIntegrity)
if err != nil { if err != nil {
xlog.Error("llama-cpp backend not found, failed to install it", "error", err) xlog.Error("llama-cpp backend not found, failed to install it", "error", err)
return "", err return "", err
@@ -76,7 +76,7 @@ func (r *LLamaCPP) Run(ctx *cliContext.Context) error {
if err != nil { if err != nil {
return err return err
} }
grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState) grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
if err != nil { if err != nil {
return err return err
} }

View File

@@ -9,8 +9,8 @@ import (
const mlxDistributedGalleryName = "mlx-distributed" const mlxDistributedGalleryName = "mlx-distributed"
func findMLXDistributedBackendPath(galleries string, systemState *system.SystemState) (string, error) { func findMLXDistributedBackendPath(galleries string, systemState *system.SystemState, requireIntegrity bool) (string, error) {
return findBackendPath(mlxDistributedGalleryName, galleries, systemState) return findBackendPath(mlxDistributedGalleryName, galleries, systemState, requireIntegrity)
} }
// buildMLXCommand builds the exec.Cmd to launch the mlx-distributed backend. // buildMLXCommand builds the exec.Cmd to launch the mlx-distributed backend.

View File

@@ -28,7 +28,7 @@ func (r *MLXDistributed) Run(ctx *cliContext.Context) error {
return err return err
} }
backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState) backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
if err != nil { if err != nil {
return fmt.Errorf("cannot find mlx-distributed backend: %w", err) return fmt.Errorf("cannot find mlx-distributed backend: %w", err)
} }

View File

@@ -73,7 +73,7 @@ func (r *P2P) Run(ctx *cliContext.Context) error {
for { for {
xlog.Info("Starting llama-cpp-rpc-server", "address", address, "port", port) xlog.Info("Starting llama-cpp-rpc-server", "address", address, "port", port)
grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState) grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
if err != nil { if err != nil {
xlog.Error("Failed to find llama-cpp-rpc-server", "error", err) xlog.Error("Failed to find llama-cpp-rpc-server", "error", err)
return return

View File

@@ -48,7 +48,7 @@ func (r *P2PMLX) Run(ctx *cliContext.Context) error {
c, cancel := context.WithCancel(context.Background()) c, cancel := context.WithCancel(context.Background())
defer cancel() defer cancel()
backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState) backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
if err != nil { if err != nil {
xlog.Warn("Could not find mlx-distributed backend from gallery, will try backend.py directly", "error", err) xlog.Warn("Could not find mlx-distributed backend from gallery, will try backend.py directly", "error", err)
} }

View File

@@ -77,7 +77,7 @@ func (r *VLLMDistributed) Run(ctx *cliContext.Context) error {
return fmt.Errorf("getting system state: %w", err) return fmt.Errorf("getting system state: %w", err)
} }
backendPath, err := findBackendPath("vllm", r.BackendGalleries, systemState) backendPath, err := findBackendPath("vllm", r.BackendGalleries, systemState, r.RequireBackendIntegrity)
if err != nil { if err != nil {
return fmt.Errorf("cannot find vllm backend: %w", err) return fmt.Errorf("cannot find vllm backend: %w", err)
} }

View File

@@ -21,6 +21,7 @@ type ApplicationConfig struct {
Debug bool Debug bool
EnableTracing bool EnableTracing bool
TracingMaxItems int TracingMaxItems int
TracingMaxBodyBytes int // Per-body cap for captured request/response bodies; 0 disables the cap
EnableBackendLogging bool EnableBackendLogging bool
GeneratedContentDir string GeneratedContentDir string
@@ -60,6 +61,13 @@ type ApplicationConfig struct {
AutoUpgradeBackends bool AutoUpgradeBackends bool
PreferDevelopmentBackends bool PreferDevelopmentBackends bool
// RequireBackendIntegrity promotes a missing SHA256 (tarball/HTTP URIs)
// or missing verification policy (OCI URIs) from a warning to a hard
// failure during backend install/upgrade. Off by default to keep
// upgrades non-breaking; operators opt in explicitly via
// --require-backend-integrity / LOCALAI_REQUIRE_BACKEND_INTEGRITY.
RequireBackendIntegrity bool
SingleBackend bool // Deprecated: use MaxActiveBackends = 1 instead SingleBackend bool // Deprecated: use MaxActiveBackends = 1 instead
MaxActiveBackends int // Maximum number of active backends (0 = unlimited, 1 = single backend mode) MaxActiveBackends int // Maximum number of active backends (0 = unlimited, 1 = single backend mode)
WatchDogIdle bool WatchDogIdle bool
@@ -180,6 +188,7 @@ func NewApplicationConfig(o ...AppOption) *ApplicationConfig {
LRUEvictionRetryInterval: 1 * time.Second, // Default: 1 second LRUEvictionRetryInterval: 1 * time.Second, // Default: 1 second
WatchDogInterval: 500 * time.Millisecond, // Default: 500ms WatchDogInterval: 500 * time.Millisecond, // Default: 500ms
TracingMaxItems: 1024, TracingMaxItems: 1024,
TracingMaxBodyBytes: 64 * 1024, // 64 KiB - caps each request/response body in the trace buffer
AgentPool: AgentPoolConfig{ AgentPool: AgentPoolConfig{
Enabled: true, Enabled: true,
Timeout: "5m", Timeout: "5m",
@@ -436,6 +445,10 @@ func WithAutoUpgradeBackends(v bool) AppOption {
return func(o *ApplicationConfig) { o.AutoUpgradeBackends = v } return func(o *ApplicationConfig) { o.AutoUpgradeBackends = v }
} }
func WithRequireBackendIntegrity(v bool) AppOption {
return func(o *ApplicationConfig) { o.RequireBackendIntegrity = v }
}
func WithPreferDevelopmentBackends(v bool) AppOption { func WithPreferDevelopmentBackends(v bool) AppOption {
return func(o *ApplicationConfig) { o.PreferDevelopmentBackends = v } return func(o *ApplicationConfig) { o.PreferDevelopmentBackends = v }
} }
@@ -567,6 +580,12 @@ func WithTracingMaxItems(items int) AppOption {
} }
} }
func WithTracingMaxBodyBytes(bytes int) AppOption {
return func(o *ApplicationConfig) {
o.TracingMaxBodyBytes = bytes
}
}
func WithGeneratedContentDir(generatedContentDir string) AppOption { func WithGeneratedContentDir(generatedContentDir string) AppOption {
return func(o *ApplicationConfig) { return func(o *ApplicationConfig) {
o.GeneratedContentDir = generatedContentDir o.GeneratedContentDir = generatedContentDir
@@ -909,6 +928,7 @@ func (o *ApplicationConfig) ToRuntimeSettings() RuntimeSettings {
f16 := o.F16 f16 := o.F16
debug := o.Debug debug := o.Debug
tracingMaxItems := o.TracingMaxItems tracingMaxItems := o.TracingMaxItems
tracingMaxBodyBytes := o.TracingMaxBodyBytes
enableTracing := o.EnableTracing enableTracing := o.EnableTracing
enableBackendLogging := o.EnableBackendLogging enableBackendLogging := o.EnableBackendLogging
cors := o.CORS cors := o.CORS
@@ -997,6 +1017,7 @@ func (o *ApplicationConfig) ToRuntimeSettings() RuntimeSettings {
F16: &f16, F16: &f16,
Debug: &debug, Debug: &debug,
TracingMaxItems: &tracingMaxItems, TracingMaxItems: &tracingMaxItems,
TracingMaxBodyBytes: &tracingMaxBodyBytes,
EnableTracing: &enableTracing, EnableTracing: &enableTracing,
EnableBackendLogging: &enableBackendLogging, EnableBackendLogging: &enableBackendLogging,
CORS: &cors, CORS: &cors,
@@ -1135,6 +1156,9 @@ func (o *ApplicationConfig) ApplyRuntimeSettings(settings *RuntimeSettings) (req
if settings.TracingMaxItems != nil { if settings.TracingMaxItems != nil {
o.TracingMaxItems = *settings.TracingMaxItems o.TracingMaxItems = *settings.TracingMaxItems
} }
if settings.TracingMaxBodyBytes != nil {
o.TracingMaxBodyBytes = *settings.TracingMaxBodyBytes
}
if settings.EnableBackendLogging != nil { if settings.EnableBackendLogging != nil {
o.EnableBackendLogging = *settings.EnableBackendLogging o.EnableBackendLogging = *settings.EnableBackendLogging
} }

View File

@@ -24,6 +24,7 @@ const (
UsecaseVAD = "vad" UsecaseVAD = "vad"
UsecaseAudioTransform = "audio_transform" UsecaseAudioTransform = "audio_transform"
UsecaseDiarization = "diarization" UsecaseDiarization = "diarization"
UsecaseRealtimeAudio = "realtime_audio"
) )
// GRPCMethod identifies a Backend service RPC from backend.proto. // GRPCMethod identifies a Backend service RPC from backend.proto.
@@ -45,6 +46,7 @@ const (
MethodVAD GRPCMethod = "VAD" MethodVAD GRPCMethod = "VAD"
MethodAudioTransform GRPCMethod = "AudioTransform" MethodAudioTransform GRPCMethod = "AudioTransform"
MethodDiarize GRPCMethod = "Diarize" MethodDiarize GRPCMethod = "Diarize"
MethodAudioToAudioStream GRPCMethod = "AudioToAudioStream"
) )
// UsecaseInfo describes a single known_usecase value and how it maps // UsecaseInfo describes a single known_usecase value and how it maps
@@ -147,6 +149,11 @@ var UsecaseInfoMap = map[string]UsecaseInfo{
GRPCMethod: MethodDiarize, GRPCMethod: MethodDiarize,
Description: "Speaker diarization (who-spoke-when, per-speaker segments) via the Diarize RPC.", Description: "Speaker diarization (who-spoke-when, per-speaker segments) via the Diarize RPC.",
}, },
UsecaseRealtimeAudio: {
Flag: FLAG_REALTIME_AUDIO,
GRPCMethod: MethodAudioToAudioStream,
Description: "Self-contained any-to-any audio model for the Realtime API — accepts microphone audio and emits speech + transcript (+ optional function calls) from a single backend via the AudioToAudioStream RPC.",
},
} }
// BackendCapability describes which gRPC methods and usecases a backend supports. // BackendCapability describes which gRPC methods and usecases a backend supports.
@@ -397,6 +404,15 @@ var BackendCapabilities = map[string]BackendCapability{
Description: "Meta MusicGen via transformers — music generation from text", Description: "Meta MusicGen via transformers — music generation from text",
}, },
// --- Any-to-any audio backends ---
"liquid-audio": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodAudioTranscription, MethodTTS, MethodAudioToAudioStream, MethodVAD},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTranscript, UsecaseTTS, UsecaseRealtimeAudio, UsecaseVAD},
DefaultUsecases: []string{UsecaseRealtimeAudio, UsecaseChat, UsecaseTranscript, UsecaseTTS, UsecaseVAD},
AcceptsAudios: true,
Description: "LFM2 / LFM2.5-Audio — self-contained any-to-any audio model for the Realtime API; also exposes chat, transcription, TTS and a stub energy-based VAD endpoint",
},
// --- Audio transform backends --- // --- Audio transform backends ---
"localvqe": { "localvqe": {
GRPCMethods: []GRPCMethod{MethodAudioTransform}, GRPCMethods: []GRPCMethod{MethodAudioTransform},

View File

@@ -31,8 +31,19 @@ type DistributedConfig struct {
DrainTimeout time.Duration // Time to wait for in-flight requests during drain (default 30s) DrainTimeout time.Duration // Time to wait for in-flight requests during drain (default 30s)
HealthCheckInterval time.Duration // Health monitor check interval (default 15s) HealthCheckInterval time.Duration // Health monitor check interval (default 15s)
StaleNodeThreshold time.Duration // Time before a node is considered stale (default 60s) StaleNodeThreshold time.Duration // Time before a node is considered stale (default 60s)
PerModelHealthCheck bool // Enable per-model backend health checking (default false) // DisablePerModelHealthCheck turns off the health monitor's per-model
MCPCIJobTimeout time.Duration // MCP CI job execution timeout (default 10m) // gRPC probe. When enabled (the default), the monitor pings each model's
// gRPC address and removes stale node_models rows whose backend has
// crashed even though the worker's node-level heartbeat is still arriving.
// Without per-model probing, /embeddings and /completions can be dispatched
// to a backend that silently returns garbage (see also the cascading
// model-row cleanup on MarkUnhealthy / MarkDraining).
DisablePerModelHealthCheck bool
MCPCIJobTimeout time.Duration // MCP CI job execution timeout (default 10m)
BackendInstallTimeout time.Duration // NATS round-trip timeout for backend.install (default 15m)
BackendUpgradeTimeout time.Duration // NATS round-trip timeout for backend.upgrade (default 15m)
MaxUploadSize int64 // Maximum upload body size in bytes (default 50 GB) MaxUploadSize int64 // Maximum upload body size in bytes (default 50 GB)
@@ -60,13 +71,15 @@ func (c DistributedConfig) Validate() error {
} }
// Check for negative durations // Check for negative durations
for name, d := range map[string]time.Duration{ for name, d := range map[string]time.Duration{
"mcp-tool-timeout": c.MCPToolTimeout, FlagMCPToolTimeout: c.MCPToolTimeout,
"mcp-discovery-timeout": c.MCPDiscoveryTimeout, FlagMCPDiscoveryTimeout: c.MCPDiscoveryTimeout,
"worker-wait-timeout": c.WorkerWaitTimeout, FlagWorkerWaitTimeout: c.WorkerWaitTimeout,
"drain-timeout": c.DrainTimeout, FlagDrainTimeout: c.DrainTimeout,
"health-check-interval": c.HealthCheckInterval, FlagHealthCheckInterval: c.HealthCheckInterval,
"stale-node-threshold": c.StaleNodeThreshold, FlagStaleNodeThreshold: c.StaleNodeThreshold,
"mcp-ci-job-timeout": c.MCPCIJobTimeout, FlagMCPCIJobTimeout: c.MCPCIJobTimeout,
FlagBackendInstallTimeout: c.BackendInstallTimeout,
FlagBackendUpgradeTimeout: c.BackendUpgradeTimeout,
} { } {
if d < 0 { if d < 0 {
return fmt.Errorf("%s must not be negative", name) return fmt.Errorf("%s must not be negative", name)
@@ -129,24 +142,66 @@ func WithStorageSecretKey(key string) AppOption {
} }
} }
func WithBackendInstallTimeout(d time.Duration) AppOption {
return func(o *ApplicationConfig) {
o.Distributed.BackendInstallTimeout = d
}
}
func WithBackendUpgradeTimeout(d time.Duration) AppOption {
return func(o *ApplicationConfig) {
o.Distributed.BackendUpgradeTimeout = d
}
}
var EnableAutoApproveNodes = func(o *ApplicationConfig) { var EnableAutoApproveNodes = func(o *ApplicationConfig) {
o.Distributed.AutoApproveNodes = true o.Distributed.AutoApproveNodes = true
} }
// Flag names for distributed timeout / interval configuration. These are
// the kebab-case identifiers kong derives from the matching RunCMD struct
// fields; they appear in Validate error messages and any other operator-
// facing surface that needs to reference a specific knob by name. Keeping
// them as constants prevents the string from drifting from the actual
// flag a future rename would produce.
const (
FlagMCPToolTimeout = "mcp-tool-timeout"
FlagMCPDiscoveryTimeout = "mcp-discovery-timeout"
FlagWorkerWaitTimeout = "worker-wait-timeout"
FlagDrainTimeout = "drain-timeout"
FlagHealthCheckInterval = "health-check-interval"
FlagStaleNodeThreshold = "stale-node-threshold"
FlagMCPCIJobTimeout = "mcp-ci-job-timeout"
FlagBackendInstallTimeout = "backend-install-timeout"
FlagBackendUpgradeTimeout = "backend-upgrade-timeout"
)
// Defaults for distributed timeouts. // Defaults for distributed timeouts.
const ( const (
DefaultMCPToolTimeout = 360 * time.Second DefaultMCPToolTimeout = 360 * time.Second
DefaultMCPDiscoveryTimeout = 60 * time.Second DefaultMCPDiscoveryTimeout = 60 * time.Second
DefaultWorkerWaitTimeout = 5 * time.Minute DefaultWorkerWaitTimeout = 5 * time.Minute
DefaultDrainTimeout = 30 * time.Second DefaultDrainTimeout = 30 * time.Second
DefaultHealthCheckInterval = 15 * time.Second DefaultHealthCheckInterval = 15 * time.Second
DefaultStaleNodeThreshold = 60 * time.Second DefaultStaleNodeThreshold = 60 * time.Second
DefaultMCPCIJobTimeout = 10 * time.Minute DefaultMCPCIJobTimeout = 10 * time.Minute
DefaultBackendInstallTimeout = 15 * time.Minute
DefaultBackendUpgradeTimeout = 15 * time.Minute
) )
// DefaultMaxUploadSize is the default maximum upload body size (50 GB). // DefaultMaxUploadSize is the default maximum upload body size (50 GB).
const DefaultMaxUploadSize int64 = 50 << 30 const DefaultMaxUploadSize int64 = 50 << 30
// BackendInstallTimeoutOrDefault returns the configured timeout or the default.
func (c DistributedConfig) BackendInstallTimeoutOrDefault() time.Duration {
return cmp.Or(c.BackendInstallTimeout, DefaultBackendInstallTimeout)
}
// BackendUpgradeTimeoutOrDefault returns the configured timeout or the default.
func (c DistributedConfig) BackendUpgradeTimeoutOrDefault() time.Duration {
return cmp.Or(c.BackendUpgradeTimeout, DefaultBackendUpgradeTimeout)
}
// MCPToolTimeoutOrDefault returns the configured timeout or the default. // MCPToolTimeoutOrDefault returns the configured timeout or the default.
func (c DistributedConfig) MCPToolTimeoutOrDefault() time.Duration { func (c DistributedConfig) MCPToolTimeoutOrDefault() time.Duration {
return cmp.Or(c.MCPToolTimeout, DefaultMCPToolTimeout) return cmp.Or(c.MCPToolTimeout, DefaultMCPToolTimeout)

View File

@@ -0,0 +1,90 @@
package config_test
import (
"time"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
)
var _ = Describe("DistributedConfig backend NATS timeouts", func() {
Context("BackendInstallTimeoutOrDefault", func() {
It("returns 15 minutes when unset", func() {
c := config.DistributedConfig{}
Expect(c.BackendInstallTimeoutOrDefault()).To(Equal(15 * time.Minute))
})
It("returns the configured value when set", func() {
c := config.DistributedConfig{BackendInstallTimeout: 42 * time.Minute}
Expect(c.BackendInstallTimeoutOrDefault()).To(Equal(42 * time.Minute))
})
})
Context("BackendUpgradeTimeoutOrDefault", func() {
It("returns 15 minutes when unset", func() {
c := config.DistributedConfig{}
Expect(c.BackendUpgradeTimeoutOrDefault()).To(Equal(15 * time.Minute))
})
It("returns the configured value when set", func() {
c := config.DistributedConfig{BackendUpgradeTimeout: 30 * time.Minute}
Expect(c.BackendUpgradeTimeoutOrDefault()).To(Equal(30 * time.Minute))
})
})
})
var _ = Describe("DistributedConfig flag-name constants", func() {
// Pin the kebab-case strings so a rename of the Go field name (or a
// CLI flag naming convention change) forces the constant to update,
// keeping the Validate error messages and any future operator-facing
// surface in sync with the actual CLI flag.
DescribeTable("flag name constants",
func(actual, expected string) {
Expect(actual).To(Equal(expected))
},
Entry("MCP tool timeout", config.FlagMCPToolTimeout, "mcp-tool-timeout"),
Entry("MCP discovery timeout", config.FlagMCPDiscoveryTimeout, "mcp-discovery-timeout"),
Entry("worker wait timeout", config.FlagWorkerWaitTimeout, "worker-wait-timeout"),
Entry("drain timeout", config.FlagDrainTimeout, "drain-timeout"),
Entry("health check interval", config.FlagHealthCheckInterval, "health-check-interval"),
Entry("stale node threshold", config.FlagStaleNodeThreshold, "stale-node-threshold"),
Entry("MCP CI job timeout", config.FlagMCPCIJobTimeout, "mcp-ci-job-timeout"),
Entry("backend install timeout", config.FlagBackendInstallTimeout, "backend-install-timeout"),
Entry("backend upgrade timeout", config.FlagBackendUpgradeTimeout, "backend-upgrade-timeout"),
)
})
var _ = Describe("DistributedConfig.Validate negative-duration errors", func() {
It("rejects a negative BackendInstallTimeout with the flag name in the error", func() {
c := config.DistributedConfig{
Enabled: true,
NatsURL: "nats://localhost:4222",
BackendInstallTimeout: -1 * time.Second,
}
err := c.Validate()
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring(config.FlagBackendInstallTimeout))
Expect(err.Error()).To(ContainSubstring("must not be negative"))
})
It("rejects a negative BackendUpgradeTimeout with the flag name in the error", func() {
c := config.DistributedConfig{
Enabled: true,
NatsURL: "nats://localhost:4222",
BackendUpgradeTimeout: -1 * time.Second,
}
err := c.Validate()
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring(config.FlagBackendUpgradeTimeout))
})
It("accepts all-zero durations as valid (defaults apply)", func() {
c := config.DistributedConfig{
Enabled: true,
NatsURL: "nats://localhost:4222",
}
Expect(c.Validate()).To(Succeed())
})
})

View File

@@ -1,6 +1,37 @@
package config package config
type Gallery struct { // GalleryVerification declares the keyless-cosign signature policy that
URL string `json:"url" yaml:"url"` // every OCI backend image fetched from this gallery must satisfy.
Name string `json:"name" yaml:"name"` //
// Verification is opt-in: galleries without a Verification block install
// backends with no signature check (the downloader logs a warning when
// LOCALAI_REQUIRE_BACKEND_INTEGRITY is unset; that flag turns the warning
// into a hard error).
//
// Identity matching: set Issuer (exact) or IssuerRegex, AND Identity
// (exact) or IdentityRegex. For GitHub Actions keyless signing the
// typical shape is:
//
// verification:
// issuer: "https://token.actions.githubusercontent.com"
// identity_regex: "^https://github\\.com/mudler/local-ai-backends/\\.github/workflows/build\\.yaml@refs/heads/master$"
// not_before: "2026-05-01T00:00:00Z"
//
// NotBefore is the revocation lever: advance it to invalidate every
// signature produced before a known compromise window. Keyless cosign
// certs are ephemeral so there is no CA-side revocation.
type GalleryVerification struct {
Issuer string `json:"issuer,omitempty" yaml:"issuer,omitempty"`
IssuerRegex string `json:"issuer_regex,omitempty" yaml:"issuer_regex,omitempty"`
Identity string `json:"identity,omitempty" yaml:"identity,omitempty"`
IdentityRegex string `json:"identity_regex,omitempty" yaml:"identity_regex,omitempty"`
// NotBefore is an RFC3339 timestamp. Empty disables the time check.
NotBefore string `json:"not_before,omitempty" yaml:"not_before,omitempty"`
}
type Gallery struct {
URL string `json:"url" yaml:"url"`
Name string `json:"name" yaml:"name"`
Verification *GalleryVerification `json:"verification,omitempty" yaml:"verification,omitempty"`
} }

View File

@@ -54,6 +54,13 @@ func guessGGUFFromFile(cfg *ModelConfig, f *gguf.GGUFFile, defaultCtx int) {
cfg.modelTemplate = chatTemplate.ValueString() cfg.modelTemplate = chatTemplate.ValueString()
} }
// Auto-enable Multi-Token Prediction (ggml-org/llama.cpp#22673) when the
// GGUF carries an embedded MTP head. Skipped silently for non-MTP models
// and when the user already configured a spec_type.
if n, ok := HasEmbeddedMTPHead(f); ok {
ApplyMTPDefaults(cfg, n)
}
// Thinking support detection is done after model load via DetectThinkingSupportFromBackend // Thinking support detection is done after model load via DetectThinkingSupportFromBackend
// template estimations // template estimations

View File

@@ -136,4 +136,36 @@ var _ = Describe("Backend hooks and parser defaults", func() {
Expect(cfg.EngineArgs["enable_chunked_prefill"]).To(Equal(true)) Expect(cfg.EngineArgs["enable_chunked_prefill"]).To(Equal(true))
}) })
}) })
Context("PromptCacheAll default", func() {
It("defaults to true when omitted from YAML", func() {
cfg := &ModelConfig{}
cfg.SetDefaults()
Expect(cfg.PromptCacheAll).NotTo(BeNil())
Expect(*cfg.PromptCacheAll).To(BeTrue())
})
It("preserves an explicit false from YAML", func() {
falseV := false
cfg := &ModelConfig{
LLMConfig: LLMConfig{PromptCacheAll: &falseV},
}
cfg.SetDefaults()
Expect(cfg.PromptCacheAll).NotTo(BeNil())
Expect(*cfg.PromptCacheAll).To(BeFalse())
})
It("preserves an explicit true from YAML", func() {
trueV := true
cfg := &ModelConfig{
LLMConfig: LLMConfig{PromptCacheAll: &trueV},
}
cfg.SetDefaults()
Expect(cfg.PromptCacheAll).NotTo(BeNil())
Expect(*cfg.PromptCacheAll).To(BeTrue())
})
})
}) })

View File

@@ -209,7 +209,7 @@ type LLMConfig struct {
RMSNormEps float32 `yaml:"rms_norm_eps,omitempty" json:"rms_norm_eps,omitempty"` RMSNormEps float32 `yaml:"rms_norm_eps,omitempty" json:"rms_norm_eps,omitempty"`
NGQA int32 `yaml:"ngqa,omitempty" json:"ngqa,omitempty"` NGQA int32 `yaml:"ngqa,omitempty" json:"ngqa,omitempty"`
PromptCachePath string `yaml:"prompt_cache_path,omitempty" json:"prompt_cache_path,omitempty"` PromptCachePath string `yaml:"prompt_cache_path,omitempty" json:"prompt_cache_path,omitempty"`
PromptCacheAll bool `yaml:"prompt_cache_all,omitempty" json:"prompt_cache_all,omitempty"` PromptCacheAll *bool `yaml:"prompt_cache_all,omitempty" json:"prompt_cache_all,omitempty"`
PromptCacheRO bool `yaml:"prompt_cache_ro,omitempty" json:"prompt_cache_ro,omitempty"` PromptCacheRO bool `yaml:"prompt_cache_ro,omitempty" json:"prompt_cache_ro,omitempty"`
MirostatETA *float64 `yaml:"mirostat_eta,omitempty" json:"mirostat_eta,omitempty"` MirostatETA *float64 `yaml:"mirostat_eta,omitempty" json:"mirostat_eta,omitempty"`
MirostatTAU *float64 `yaml:"mirostat_tau,omitempty" json:"mirostat_tau,omitempty"` MirostatTAU *float64 `yaml:"mirostat_tau,omitempty" json:"mirostat_tau,omitempty"`
@@ -494,6 +494,13 @@ func (cfg *ModelConfig) SetDefaults(opts ...ConfigLoaderOption) {
cfg.Reranking = &falseV cfg.Reranking = &falseV
} }
if cfg.PromptCacheAll == nil {
// Match upstream llama.cpp's default (common/common.h: cache_prompt = true)
// and let cache_idle_slots / kv_unified actually do useful work; users can
// opt out with an explicit `prompt_cache_all: false` in the model YAML.
cfg.PromptCacheAll = &trueV
}
if threads == 0 { if threads == 0 {
// Threads can't be 0 // Threads can't be 0
threads = 4 threads = 4
@@ -636,6 +643,7 @@ const (
FLAG_SPEAKER_RECOGNITION ModelConfigUsecase = 0b1000000000000000 FLAG_SPEAKER_RECOGNITION ModelConfigUsecase = 0b1000000000000000
FLAG_AUDIO_TRANSFORM ModelConfigUsecase = 0b10000000000000000 FLAG_AUDIO_TRANSFORM ModelConfigUsecase = 0b10000000000000000
FLAG_DIARIZATION ModelConfigUsecase = 0b100000000000000000 FLAG_DIARIZATION ModelConfigUsecase = 0b100000000000000000
FLAG_REALTIME_AUDIO ModelConfigUsecase = 0b1000000000000000000
// Common Subsets // Common Subsets
FLAG_LLM ModelConfigUsecase = FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT FLAG_LLM ModelConfigUsecase = FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT
@@ -645,12 +653,12 @@ const (
// Flags within the same group are NOT orthogonal (e.g., chat and completion are // Flags within the same group are NOT orthogonal (e.g., chat and completion are
// both text/language). A model is multimodal when its usecases span 2+ groups. // both text/language). A model is multimodal when its usecases span 2+ groups.
var ModalityGroups = []ModelConfigUsecase{ var ModalityGroups = []ModelConfigUsecase{
FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT, // text/language FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT, // text/language
FLAG_VISION | FLAG_DETECTION, // visual understanding FLAG_VISION | FLAG_DETECTION, // visual understanding
FLAG_TRANSCRIPT, // speech input FLAG_TRANSCRIPT | FLAG_REALTIME_AUDIO, // speech input — realtime_audio is any-to-any, so it counts here too
FLAG_TTS | FLAG_SOUND_GENERATION, // audio output FLAG_TTS | FLAG_SOUND_GENERATION | FLAG_REALTIME_AUDIO, // audio output — and here, so a lone realtime_audio flag still reads as multimodal
FLAG_AUDIO_TRANSFORM, // audio in/out transforms FLAG_AUDIO_TRANSFORM, // audio in/out transforms
FLAG_IMAGE | FLAG_VIDEO, // visual generation FLAG_IMAGE | FLAG_VIDEO, // visual generation
} }
// IsMultimodal returns true if the given usecases span two or more orthogonal // IsMultimodal returns true if the given usecases span two or more orthogonal
@@ -692,6 +700,7 @@ func GetAllModelConfigUsecases() map[string]ModelConfigUsecase {
"FLAG_SPEAKER_RECOGNITION": FLAG_SPEAKER_RECOGNITION, "FLAG_SPEAKER_RECOGNITION": FLAG_SPEAKER_RECOGNITION,
"FLAG_AUDIO_TRANSFORM": FLAG_AUDIO_TRANSFORM, "FLAG_AUDIO_TRANSFORM": FLAG_AUDIO_TRANSFORM,
"FLAG_DIARIZATION": FLAG_DIARIZATION, "FLAG_DIARIZATION": FLAG_DIARIZATION,
"FLAG_REALTIME_AUDIO": FLAG_REALTIME_AUDIO,
} }
} }
@@ -866,6 +875,16 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
} }
} }
if (u & FLAG_REALTIME_AUDIO) == FLAG_REALTIME_AUDIO {
// Backends that own a single any-to-any loop and implement
// AudioToAudioStream — listed here so models without an explicit
// known_usecases still surface on the Talk page.
realtimeAudioBackends := []string{"liquid-audio"}
if !slices.Contains(realtimeAudioBackends, c.Backend) {
return false
}
}
return true return true
} }

84
core/config/mtp.go Normal file
View File

@@ -0,0 +1,84 @@
package config
import (
"strings"
gguf "github.com/gpustack/gguf-parser-go"
"github.com/mudler/xlog"
)
// mtpSpecOptions lists the speculative-decoding option keys auto-applied when
// an MTP head is detected on a llama-cpp GGUF. Defaults track the upstream
// MTP PR (ggml-org/llama.cpp#22673):
//
// - spec_type:draft-mtp activates Multi-Token Prediction
// - spec_n_max:6 draft window
// - spec_p_min:0.75 pinned because upstream marked the 0.75 default
// with a "change to 0.0f" TODO; locking it here keeps acceptance
// thresholds stable across future bumps
var mtpSpecOptions = []string{
"spec_type:draft-mtp",
"spec_n_max:6",
"spec_p_min:0.75",
}
// MTPSpecOptions returns a copy of the option keys auto-applied when an MTP
// head is detected. Exported for testing and for the importer.
func MTPSpecOptions() []string {
out := make([]string, len(mtpSpecOptions))
copy(out, mtpSpecOptions)
return out
}
// HasEmbeddedMTPHead reports whether the parsed GGUF declares a Multi-Token
// Prediction head. Detection reads `<arch>.nextn_predict_layers`, which is
// what `gguf_writer.add_nextn_predict_layers(n)` emits in upstream's
// `conversion/qwen.py` MTP mixin. A positive layer count means the head is
// present in the same GGUF as the trunk.
func HasEmbeddedMTPHead(f *gguf.GGUFFile) (uint32, bool) {
if f == nil {
return 0, false
}
arch := f.Architecture().Architecture
if arch == "" {
return 0, false
}
v, ok := f.Header.MetadataKV.Get(arch + ".nextn_predict_layers")
if !ok {
return 0, false
}
n := gguf.ValueNumeric[uint32](v)
return n, n > 0
}
// hasSpecTypeOption returns true when the slice already contains a
// user-configured `spec_type:` / `speculative_type:` entry. Used to avoid
// clobbering an explicit choice with the MTP auto-defaults.
func hasSpecTypeOption(opts []string) bool {
for _, o := range opts {
if strings.HasPrefix(o, "spec_type:") || strings.HasPrefix(o, "speculative_type:") {
return true
}
}
return false
}
// ApplyMTPDefaults appends the auto-MTP option keys to cfg.Options when none
// is already configured. It is a no-op when the user already picked a
// `spec_type` (either via YAML or via the importer's preferences flow).
//
// `layers` is the value read from `<arch>.nextn_predict_layers` and is only
// used for the diagnostic log line.
func ApplyMTPDefaults(cfg *ModelConfig, layers uint32) {
if cfg == nil {
return
}
if hasSpecTypeOption(cfg.Options) {
xlog.Debug("[mtp] embedded MTP head detected but spec_type already configured; leaving user choice intact",
"name", cfg.Name, "nextn_layers", layers)
return
}
cfg.Options = append(cfg.Options, mtpSpecOptions...)
xlog.Info("[mtp] embedded MTP head detected; enabling draft-mtp speculative decoding",
"name", cfg.Name, "nextn_layers", layers, "spec_n_max", 6, "spec_p_min", 0.75)
}

86
core/config/mtp_test.go Normal file
View File

@@ -0,0 +1,86 @@
package config_test
import (
. "github.com/mudler/LocalAI/core/config"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("MTP auto-defaults", func() {
Context("MTPSpecOptions", func() {
It("returns the upstream-recommended speculative tuple", func() {
Expect(MTPSpecOptions()).To(Equal([]string{
"spec_type:draft-mtp",
"spec_n_max:6",
"spec_p_min:0.75",
}))
})
It("returns a defensive copy so callers cannot mutate the package default", func() {
opts := MTPSpecOptions()
opts[0] = "spec_type:none"
Expect(MTPSpecOptions()[0]).To(Equal("spec_type:draft-mtp"))
})
})
Context("ApplyMTPDefaults", func() {
It("appends MTP options when nothing is configured", func() {
cfg := &ModelConfig{Name: "qwen-mtp"}
ApplyMTPDefaults(cfg, 1)
Expect(cfg.Options).To(Equal([]string{
"spec_type:draft-mtp",
"spec_n_max:6",
"spec_p_min:0.75",
}))
})
It("preserves unrelated options already on the config", func() {
cfg := &ModelConfig{
Name: "qwen-mtp",
Options: []string{"use_jinja:true", "cache_reuse:256"},
}
ApplyMTPDefaults(cfg, 1)
Expect(cfg.Options).To(Equal([]string{
"use_jinja:true",
"cache_reuse:256",
"spec_type:draft-mtp",
"spec_n_max:6",
"spec_p_min:0.75",
}))
})
It("is a no-op when the user already configured spec_type", func() {
cfg := &ModelConfig{
Name: "qwen-mtp",
Options: []string{"spec_type:ngram-simple", "use_jinja:true"},
}
ApplyMTPDefaults(cfg, 1)
Expect(cfg.Options).To(Equal([]string{
"spec_type:ngram-simple",
"use_jinja:true",
}))
})
It("also respects the legacy speculative_type alias", func() {
cfg := &ModelConfig{
Name: "qwen-mtp",
Options: []string{"speculative_type:ngram-mod"},
}
ApplyMTPDefaults(cfg, 1)
Expect(cfg.Options).To(Equal([]string{"speculative_type:ngram-mod"}))
})
It("tolerates a nil config", func() {
Expect(func() { ApplyMTPDefaults(nil, 1) }).ToNot(Panic())
})
})
Context("HasEmbeddedMTPHead", func() {
It("returns false on a nil GGUF file", func() {
n, ok := HasEmbeddedMTPHead(nil)
Expect(ok).To(BeFalse())
Expect(n).To(BeZero())
})
})
})

View File

@@ -38,6 +38,7 @@ type RuntimeSettings struct {
Debug *bool `json:"debug,omitempty"` Debug *bool `json:"debug,omitempty"`
EnableTracing *bool `json:"enable_tracing,omitempty"` EnableTracing *bool `json:"enable_tracing,omitempty"`
TracingMaxItems *int `json:"tracing_max_items,omitempty"` TracingMaxItems *int `json:"tracing_max_items,omitempty"`
TracingMaxBodyBytes *int `json:"tracing_max_body_bytes,omitempty"` // Per-body cap in bytes; 0 disables the cap
EnableBackendLogging *bool `json:"enable_backend_logging,omitempty"` EnableBackendLogging *bool `json:"enable_backend_logging,omitempty"`
// Security/CORS settings // Security/CORS settings

View File

@@ -16,6 +16,7 @@ import (
"github.com/mudler/LocalAI/pkg/downloader" "github.com/mudler/LocalAI/pkg/downloader"
"github.com/mudler/LocalAI/pkg/model" "github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/LocalAI/pkg/oci" "github.com/mudler/LocalAI/pkg/oci"
"github.com/mudler/LocalAI/pkg/oci/cosignverify"
"github.com/mudler/LocalAI/pkg/system" "github.com/mudler/LocalAI/pkg/system"
"github.com/mudler/xlog" "github.com/mudler/xlog"
cp "github.com/otiai10/copy" cp "github.com/otiai10/copy"
@@ -102,8 +103,81 @@ func writeBackendMetadata(backendPath string, metadata *BackendMetadata) error {
return nil return nil
} }
// backendDownloadOptions translates the gallery's verification policy into
// downloader options, and gates the call on strict-integrity mode. Both
// InstallBackend and UpgradeBackend MUST route their download through these
// options — without them, the corresponding code path silently downloads
// and activates unverified backend bytes even when the gallery has a
// verification: policy configured.
//
// For OCI URIs with a verification policy, returns a slice containing
// downloader.WithImageVerifier(v) — the downloader will then run cosign
// signature verification between fetching the manifest and extracting
// layers (see pkg/downloader/uri.go OCI branch).
//
// For OCI URIs without a verification policy, or non-OCI URIs without a
// SHA256, the function either returns a non-fatal warning (requireIntegrity
// false) or fails the install (requireIntegrity true).
func backendDownloadOptions(config *GalleryBackend, requireIntegrity bool) ([]downloader.DownloadOption, error) {
uri := downloader.URI(config.URI)
hasVerification := config.Gallery.Verification != nil
hasSHA := config.SHA256 != ""
switch {
case uri.LooksLikeOCI():
if !hasVerification {
if requireIntegrity {
return nil, fmt.Errorf("strict integrity: gallery %q has no verification policy for OCI backend %q (set verification: in the gallery YAML or disable --require-backend-integrity)",
config.Gallery.Name, config.Name)
}
xlog.Warn("installing OCI backend without signature verification",
"backend", config.Name, "gallery", config.Gallery.Name, "uri", config.URI)
return nil, nil
}
v, err := newGalleryVerifier(config.Gallery.Verification)
if err != nil {
return nil, fmt.Errorf("gallery %q verification policy: %w", config.Gallery.Name, err)
}
return []downloader.DownloadOption{downloader.WithImageVerifier(v)}, nil
case uri.LooksLikeDir():
// Local directory — out of scope for integrity checks.
return nil, nil
default:
if !hasSHA && requireIntegrity {
return nil, fmt.Errorf("strict integrity: backend %q has no SHA256 (gallery %q)",
config.Name, config.Gallery.Name)
}
// Non-strict: pkg/downloader already emits a warning when sha is empty.
return nil, nil
}
}
// newGalleryVerifier constructs a cosignverify.Verifier from the gallery
// policy. Parses NotBefore (RFC3339) here so YAML errors surface at install
// time rather than during signature verification.
func newGalleryVerifier(p *config.GalleryVerification) (*cosignverify.Verifier, error) {
pol := cosignverify.Policy{
Issuer: p.Issuer,
IssuerRegex: p.IssuerRegex,
Identity: p.Identity,
IdentityRegex: p.IdentityRegex,
}
if p.NotBefore != "" {
t, err := time.Parse(time.RFC3339, p.NotBefore)
if err != nil {
return nil, fmt.Errorf("not_before %q: %w", p.NotBefore, err)
}
pol.NotBefore = t
}
return cosignverify.NewVerifier(pol, nil, nil)
}
// InstallBackendFromGallery installs a backend from the gallery. // InstallBackendFromGallery installs a backend from the gallery.
func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, modelLoader *model.ModelLoader, name string, downloadStatus func(string, string, string, float64), force bool) error { // requireIntegrity escalates a missing SHA256 / verification policy from a
// warning to a hard failure (see backendDownloadOptions).
func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, modelLoader *model.ModelLoader, name string, downloadStatus func(string, string, string, float64), force, requireIntegrity bool) error {
if !force { if !force {
// check if we already have the backend installed // check if we already have the backend installed
backends, err := ListSystemBackends(systemState) backends, err := ListSystemBackends(systemState)
@@ -149,7 +223,7 @@ func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery,
xlog.Debug("Installing backend from meta backend", "name", name, "bestBackend", bestBackend.Name) xlog.Debug("Installing backend from meta backend", "name", name, "bestBackend", bestBackend.Name)
// Then, let's install the best backend // Then, let's install the best backend
if err := InstallBackend(ctx, systemState, modelLoader, bestBackend, downloadStatus); err != nil { if err := InstallBackend(ctx, systemState, modelLoader, bestBackend, downloadStatus, requireIntegrity); err != nil {
return err return err
} }
@@ -175,10 +249,10 @@ func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery,
return nil return nil
} }
return InstallBackend(ctx, systemState, modelLoader, backend, downloadStatus) return InstallBackend(ctx, systemState, modelLoader, backend, downloadStatus, requireIntegrity)
} }
func InstallBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, config *GalleryBackend, downloadStatus func(string, string, string, float64)) error { func InstallBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, config *GalleryBackend, downloadStatus func(string, string, string, float64), requireIntegrity bool) error {
// Get configurable fallback tag values from SystemState // Get configurable fallback tag values from SystemState
latestTag, masterTag, devSuffix := getFallbackTagValues(systemState) latestTag, masterTag, devSuffix := getFallbackTagValues(systemState)
@@ -213,6 +287,14 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
return fmt.Errorf("failed to create base path: %v", err) return fmt.Errorf("failed to create base path: %v", err)
} }
// Build the download options once and reuse for every retry path —
// mirrors and tag fallbacks must verify against the same gallery
// policy or we open a hole where a non-default URI bypasses the check.
downloadOpts, optsErr := backendDownloadOptions(config, requireIntegrity)
if optsErr != nil {
return fmt.Errorf("backend %q: %w", config.Name, optsErr)
}
uri := downloader.URI(config.URI) uri := downloader.URI(config.URI)
// Check if it is a directory // Check if it is a directory
if uri.LooksLikeDir() { if uri.LooksLikeDir() {
@@ -222,7 +304,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
} }
} else { } else {
xlog.Debug("Downloading backend", "uri", config.URI, "backendPath", backendPath) xlog.Debug("Downloading backend", "uri", config.URI, "backendPath", backendPath)
if err := uri.DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err != nil { if err := uri.DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err != nil {
xlog.Debug("Backend download failed, trying fallback", "backendPath", backendPath, "error", err) xlog.Debug("Backend download failed, trying fallback", "backendPath", backendPath, "error", err)
// resetBackendPath cleans up partial state from a failed OCI extraction // resetBackendPath cleans up partial state from a failed OCI extraction
@@ -243,7 +325,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
default: default:
} }
resetBackendPath() resetBackendPath()
if err := downloader.URI(mirror).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err == nil { if err := downloader.URI(mirror).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err == nil {
success = true success = true
xlog.Debug("Downloaded backend from mirror", "uri", config.URI, "backendPath", backendPath) xlog.Debug("Downloaded backend from mirror", "uri", config.URI, "backendPath", backendPath)
break break
@@ -256,7 +338,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
if fallbackURI != string(config.URI) { if fallbackURI != string(config.URI) {
resetBackendPath() resetBackendPath()
xlog.Info("Trying fallback URI", "original", config.URI, "fallback", fallbackURI) xlog.Info("Trying fallback URI", "original", config.URI, "fallback", fallbackURI)
if err := downloader.URI(fallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err == nil { if err := downloader.URI(fallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err == nil {
xlog.Info("Downloaded backend using fallback URI", "uri", fallbackURI, "backendPath", backendPath) xlog.Info("Downloaded backend using fallback URI", "uri", fallbackURI, "backendPath", backendPath)
success = true success = true
} else { } else {
@@ -265,7 +347,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
resetBackendPath() resetBackendPath()
devFallbackURI := fallbackURI + "-" + devSuffix devFallbackURI := fallbackURI + "-" + devSuffix
xlog.Info("Trying development fallback URI", "fallback", devFallbackURI) xlog.Info("Trying development fallback URI", "fallback", devFallbackURI)
if err := downloader.URI(devFallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err == nil { if err := downloader.URI(devFallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err == nil {
xlog.Info("Downloaded backend using development fallback URI", "uri", devFallbackURI, "backendPath", backendPath) xlog.Info("Downloaded backend using development fallback URI", "uri", devFallbackURI, "backendPath", backendPath)
success = true success = true
} else { } else {

View File

@@ -117,13 +117,13 @@ var _ = Describe("Gallery Backends", func() {
Describe("InstallBackendFromGallery", func() { Describe("InstallBackendFromGallery", func() {
It("should return error when backend is not found", func() { It("should return error when backend is not found", func() {
err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "non-existent", nil, true) err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "non-existent", nil, true, false)
Expect(err).To(HaveOccurred()) Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("no backend found with name \"non-existent\"")) Expect(err.Error()).To(ContainSubstring("no backend found with name \"non-existent\""))
}) })
It("should install backend from gallery", func() { It("should install backend from gallery", func() {
err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "test-backend", nil, true) err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "test-backend", nil, true, false)
Expect(err).ToNot(HaveOccurred()) Expect(err).ToNot(HaveOccurred())
Expect(filepath.Join(tempDir, "test-backend", "run.sh")).To(BeARegularFile()) Expect(filepath.Join(tempDir, "test-backend", "run.sh")).To(BeARegularFile())
}) })
@@ -545,7 +545,7 @@ var _ = Describe("Gallery Backends", func() {
VRAM: 1000000000000, VRAM: 1000000000000,
Backend: system.Backend{BackendsPath: tempDir}, Backend: system.Backend{BackendsPath: tempDir},
} }
err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true) err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true, false)
Expect(err).NotTo(HaveOccurred()) Expect(err).NotTo(HaveOccurred())
metaBackendPath := filepath.Join(tempDir, "meta-backend") metaBackendPath := filepath.Join(tempDir, "meta-backend")
@@ -625,7 +625,7 @@ var _ = Describe("Gallery Backends", func() {
VRAM: 1000000000000, VRAM: 1000000000000,
Backend: system.Backend{BackendsPath: tempDir}, Backend: system.Backend{BackendsPath: tempDir},
} }
err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true) err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true, false)
Expect(err).NotTo(HaveOccurred()) Expect(err).NotTo(HaveOccurred())
metaBackendPath := filepath.Join(tempDir, "meta-backend") metaBackendPath := filepath.Join(tempDir, "meta-backend")
@@ -709,7 +709,7 @@ var _ = Describe("Gallery Backends", func() {
VRAM: 1000000000000, VRAM: 1000000000000,
Backend: system.Backend{BackendsPath: tempDir}, Backend: system.Backend{BackendsPath: tempDir},
} }
err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true) err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true, false)
Expect(err).NotTo(HaveOccurred()) Expect(err).NotTo(HaveOccurred())
metaBackendPath := filepath.Join(tempDir, "meta-backend") metaBackendPath := filepath.Join(tempDir, "meta-backend")
@@ -808,7 +808,7 @@ var _ = Describe("Gallery Backends", func() {
system.WithBackendPath(newPath), system.WithBackendPath(newPath),
) )
Expect(err).NotTo(HaveOccurred()) Expect(err).NotTo(HaveOccurred())
err = InstallBackend(context.TODO(), systemState, ml, &backend, nil) err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
Expect(newPath).To(BeADirectory()) Expect(newPath).To(BeADirectory())
Expect(err).To(HaveOccurred()) // Will fail due to invalid URI, but path should be created Expect(err).To(HaveOccurred()) // Will fail due to invalid URI, but path should be created
}) })
@@ -840,7 +840,7 @@ var _ = Describe("Gallery Backends", func() {
system.WithBackendPath(tempDir), system.WithBackendPath(tempDir),
) )
Expect(err).NotTo(HaveOccurred()) Expect(err).NotTo(HaveOccurred())
err = InstallBackend(context.TODO(), systemState, ml, &backend, nil) err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
Expect(err).ToNot(HaveOccurred()) Expect(err).ToNot(HaveOccurred())
Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).To(BeARegularFile()) Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).To(BeARegularFile())
dat, err := os.ReadFile(filepath.Join(tempDir, "test-backend", "metadata.json")) dat, err := os.ReadFile(filepath.Join(tempDir, "test-backend", "metadata.json"))
@@ -873,7 +873,7 @@ var _ = Describe("Gallery Backends", func() {
Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).ToNot(BeARegularFile()) Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).ToNot(BeARegularFile())
err = InstallBackend(context.TODO(), systemState, ml, &backend, nil) err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
Expect(err).ToNot(HaveOccurred()) Expect(err).ToNot(HaveOccurred())
Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).To(BeARegularFile()) Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).To(BeARegularFile())
}) })
@@ -894,7 +894,7 @@ var _ = Describe("Gallery Backends", func() {
system.WithBackendPath(tempDir), system.WithBackendPath(tempDir),
) )
Expect(err).NotTo(HaveOccurred()) Expect(err).NotTo(HaveOccurred())
err = InstallBackend(context.TODO(), systemState, ml, &backend, nil) err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
Expect(err).ToNot(HaveOccurred()) Expect(err).ToNot(HaveOccurred())
Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).To(BeARegularFile()) Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).To(BeARegularFile())

View File

@@ -47,7 +47,7 @@ var _ = Describe("Backend versioning", func() {
backend.URI = srcDir backend.URI = srcDir
backend.Version = "1.2.3" backend.Version = "1.2.3"
err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil) err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil, false)
Expect(err).NotTo(HaveOccurred()) Expect(err).NotTo(HaveOccurred())
// Read the metadata file and check version // Read the metadata file and check version
@@ -74,7 +74,7 @@ var _ = Describe("Backend versioning", func() {
backend.URI = srcDir backend.URI = srcDir
backend.Version = "2.0.0" backend.Version = "2.0.0"
err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil) err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil, false)
Expect(err).NotTo(HaveOccurred()) Expect(err).NotTo(HaveOccurred())
metadataPath := filepath.Join(tempDir, "test-backend-uri", "metadata.json") metadataPath := filepath.Join(tempDir, "test-backend-uri", "metadata.json")
@@ -100,7 +100,7 @@ var _ = Describe("Backend versioning", func() {
backend.URI = srcDir backend.URI = srcDir
// Version intentionally left empty // Version intentionally left empty
err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil) err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil, false)
Expect(err).NotTo(HaveOccurred()) Expect(err).NotTo(HaveOccurred())
metadataPath := filepath.Join(tempDir, "test-backend-noversion", "metadata.json") metadataPath := filepath.Join(tempDir, "test-backend-noversion", "metadata.json")

View File

@@ -130,6 +130,8 @@ var defaultImporters = []Importer{
// and would otherwise swallow the C++ port's GGUF bundles. // and would otherwise swallow the C++ port's GGUF bundles.
&VibeVoiceCppImporter{}, &VibeVoiceCppImporter{},
&VibeVoiceImporter{}, &VibeVoiceImporter{},
// LiquidAudio (Python) — keep before LlamaCPP so non-GGUF LFM2-Audio repos route here.
&LiquidAudioImporter{},
&CoquiImporter{}, &CoquiImporter{},
// Image/Video (Batch 3) // Image/Video (Batch 3)
&StableDiffusionGGMLImporter{}, &StableDiffusionGGMLImporter{},

View File

@@ -0,0 +1,145 @@
package importers
import (
"encoding/json"
"path/filepath"
"strings"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/gallery"
"github.com/mudler/LocalAI/core/schema"
"go.yaml.in/yaml/v2"
)
var _ Importer = &LiquidAudioImporter{}
// LiquidAudioImporter recognises LiquidAI's LFM2-Audio family (LFM2-Audio-1.5B,
// LFM2.5-Audio-1.5B, community finetunes) and routes them to the Python
// `liquid-audio` backend. Detection is by repo-name substring so third-party
// mirrors still match. preferences.backend="liquid-audio" overrides detection.
//
// Once upstream llama.cpp PR #18641 lands and the GGUF gallery entries are
// added, GGUF mirrors of these models should route to llama-cpp; that's
// handled by ordering LlamaCPPImporter after this one and by the explicit
// "-gguf" exclusion below.
type LiquidAudioImporter struct{}
func (i *LiquidAudioImporter) Name() string { return "liquid-audio" }
func (i *LiquidAudioImporter) Modality() string { return "tts" }
func (i *LiquidAudioImporter) AutoDetects() bool { return true }
func (i *LiquidAudioImporter) Match(details Details) bool {
preferences, err := details.Preferences.MarshalJSON()
if err != nil {
return false
}
preferencesMap := make(map[string]any)
if len(preferences) > 0 {
if err := json.Unmarshal(preferences, &preferencesMap); err != nil {
return false
}
}
if b, ok := preferencesMap["backend"].(string); ok && b == "liquid-audio" {
return true
}
matchRepo := func(repo string) bool {
r := strings.ToLower(repo)
// Cede GGUF mirrors to the (later-ordered) llama-cpp importer.
if strings.HasSuffix(r, "-gguf") {
return false
}
return strings.Contains(r, "lfm2-audio") || strings.Contains(r, "lfm2.5-audio")
}
if details.HuggingFace != nil {
repoName := details.HuggingFace.ModelID
if idx := strings.Index(repoName, "/"); idx >= 0 {
repoName = repoName[idx+1:]
}
if matchRepo(repoName) {
return true
}
}
if _, repo, ok := HFOwnerRepoFromURI(details.URI); ok {
return matchRepo(repo)
}
return false
}
func (i *LiquidAudioImporter) Import(details Details) (gallery.ModelConfig, error) {
preferences, err := details.Preferences.MarshalJSON()
if err != nil {
return gallery.ModelConfig{}, err
}
preferencesMap := make(map[string]any)
if len(preferences) > 0 {
if err := json.Unmarshal(preferences, &preferencesMap); err != nil {
return gallery.ModelConfig{}, err
}
}
name, ok := preferencesMap["name"].(string)
if !ok {
name = filepath.Base(details.URI)
}
description, ok := preferencesMap["description"].(string)
if !ok {
description = "Imported from " + details.URI
}
model := details.URI
if details.HuggingFace != nil && details.HuggingFace.ModelID != "" {
model = details.HuggingFace.ModelID
}
// Preferences may pin the mode (chat / asr / tts / s2s / finetune).
// Default to s2s — the headline any-to-any use case.
mode, _ := preferencesMap["mode"].(string)
if mode == "" {
mode = "s2s"
}
options := []string{"mode:" + mode}
if voice, ok := preferencesMap["voice"].(string); ok && voice != "" {
options = append(options, "voice:"+voice)
}
usecases := []string{"chat"}
switch mode {
case "asr":
usecases = []string{"transcript"}
case "tts":
usecases = []string{"tts"}
case "s2s":
// realtime_audio surfaces the model on the Talk page; chat/tts/
// transcript/vad keep the standalone OpenAI-compatible endpoints
// working since liquid-audio implements all of them.
usecases = []string{"realtime_audio", "chat", "tts", "transcript", "vad"}
}
modelConfig := config.ModelConfig{
Name: name,
Description: description,
Backend: "liquid-audio",
KnownUsecaseStrings: usecases,
Options: options,
PredictionOptions: schema.PredictionOptions{
BasicModelRequest: schema.BasicModelRequest{Model: model},
},
}
data, err := yaml.Marshal(modelConfig)
if err != nil {
return gallery.ModelConfig{}, err
}
return gallery.ModelConfig{
Name: name,
Description: description,
ConfigFile: string(data),
}, nil
}

View File

@@ -0,0 +1,91 @@
package importers_test
import (
"encoding/json"
"fmt"
"github.com/mudler/LocalAI/core/gallery/importers"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("LiquidAudioImporter", func() {
Context("detection from HuggingFace", func() {
It("matches LiquidAI/LFM2.5-Audio-1.5B", func() {
uri := "https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B"
preferences := json.RawMessage(`{}`)
modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: liquid-audio"))
Expect(modelConfig.ConfigFile).To(ContainSubstring("LiquidAI/LFM2.5-Audio-1.5B"))
})
It("matches LiquidAI/LFM2-Audio-1.5B (older variant)", func() {
uri := "https://huggingface.co/LiquidAI/LFM2-Audio-1.5B"
preferences := json.RawMessage(`{}`)
modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: liquid-audio"))
})
It("cedes -GGUF mirrors to the llama-cpp importer", func() {
// LiquidAI/LFM2.5-Audio-1.5B-GGUF should NOT route to liquid-audio.
// Once upstream PR #18641 lands and the GGUF gallery entry exists,
// this is the path that lets users opt into the C++ runtime.
uri := "https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B-GGUF"
preferences := json.RawMessage(`{}`)
modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
Expect(modelConfig.ConfigFile).ToNot(ContainSubstring("backend: liquid-audio"),
fmt.Sprintf("GGUF repo should not match Python importer; got: %s", modelConfig.ConfigFile))
})
})
Context("preference override", func() {
It("honours preferences.backend=liquid-audio for arbitrary URIs", func() {
uri := "https://example.com/some-unrelated-model"
preferences := json.RawMessage(`{"backend": "liquid-audio"}`)
modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: liquid-audio"))
})
It("picks up the mode preference", func() {
uri := "https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B"
preferences := json.RawMessage(`{"mode": "asr"}`)
modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
Expect(modelConfig.ConfigFile).To(ContainSubstring("mode:asr"))
Expect(modelConfig.ConfigFile).To(ContainSubstring("transcript"))
})
It("picks up the voice preference", func() {
uri := "https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B"
preferences := json.RawMessage(`{"mode": "tts", "voice": "uk_male"}`)
modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
Expect(modelConfig.ConfigFile).To(ContainSubstring("voice:uk_male"))
})
})
Context("Importer interface metadata", func() {
It("exposes name/modality/autodetect", func() {
imp := &importers.LiquidAudioImporter{}
Expect(imp.Name()).To(Equal("liquid-audio"))
Expect(imp.Modality()).To(Equal("tts"))
Expect(imp.AutoDetects()).To(BeTrue())
})
})
})

View File

@@ -1,10 +1,13 @@
package importers package importers
import ( import (
"context"
"encoding/json" "encoding/json"
"path/filepath" "path/filepath"
"strings" "strings"
"time"
gguf "github.com/gpustack/gguf-parser-go"
"github.com/mudler/LocalAI/core/config" "github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/gallery" "github.com/mudler/LocalAI/core/gallery"
"github.com/mudler/LocalAI/core/schema" "github.com/mudler/LocalAI/core/schema"
@@ -261,6 +264,13 @@ func (i *LlamaCPPImporter) Import(details Details) (gallery.ModelConfig, error)
// Apply per-model-family inference parameter defaults // Apply per-model-family inference parameter defaults
config.ApplyInferenceDefaults(&modelConfig, details.URI) config.ApplyInferenceDefaults(&modelConfig, details.URI)
// Auto-detect Multi-Token Prediction heads (ggml-org/llama.cpp#22673) and
// enable speculative decoding. Mirrors the load-time hook so freshly
// imported configs already carry spec_type:draft-mtp before the model is
// ever loaded - users see it in the YAML preview rather than discovering
// it after the first start.
maybeApplyMTPDefaults(&modelConfig, details, &cfg)
data, err := yaml.Marshal(modelConfig) data, err := yaml.Marshal(modelConfig)
if err != nil { if err != nil {
return gallery.ModelConfig{}, err return gallery.ModelConfig{}, err
@@ -291,6 +301,85 @@ func pickPreferredGroup(groups []hfapi.ShardGroup, prefs []string) *hfapi.ShardG
return &groups[len(groups)-1] return &groups[len(groups)-1]
} }
// maybeApplyMTPDefaults parses the picked GGUF header (range-fetched over
// HTTP for HF/URL imports) and, if the file declares a Multi-Token Prediction
// head, appends the auto-MTP option keys to modelConfig.Options. Failures
// during the probe are non-fatal: the importer keeps the config without MTP
// so an unrelated network blip or weird header doesn't break the import.
//
// OCI/Ollama URIs are skipped because the artifact isn't directly fetchable
// as a GGUF byte stream - the load-time hook (core/config/gguf.go) covers
// those once the model is materialised on disk.
func maybeApplyMTPDefaults(modelConfig *config.ModelConfig, details Details, cfg *gallery.ModelConfig) {
probeURL := pickMTPProbeURL(details, cfg)
if probeURL == "" {
return
}
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
defer func() {
if r := recover(); r != nil {
xlog.Debug("[mtp-importer] panic while probing GGUF header", "uri", probeURL, "recover", r)
}
}()
f, err := gguf.ParseGGUFFileRemote(ctx, probeURL)
if err != nil {
xlog.Debug("[mtp-importer] failed to read remote GGUF header for MTP detection", "uri", probeURL, "error", err)
return
}
n, ok := config.HasEmbeddedMTPHead(f)
if !ok {
return
}
config.ApplyMTPDefaults(modelConfig, n)
}
// pickMTPProbeURL returns an HTTP(S) URL pointing at the main (non-mmproj)
// GGUF shard that should be inspected for an MTP head, or "" when no
// suitable URL is available. Custom URI schemes (`huggingface://`,
// `ollama://`, etc.) are run through `downloader.URI.ResolveURL` so the
// resulting URL is something `gguf.ParseGGUFFileRemote` can actually open.
// OCI/Ollama URIs are skipped because the artifact is not directly
// streamable as a GGUF byte range.
func pickMTPProbeURL(details Details, cfg *gallery.ModelConfig) string {
uri := downloader.URI(details.URI)
if uri.LooksLikeOCI() {
return ""
}
if strings.HasSuffix(strings.ToLower(details.URI), ".gguf") {
return resolveHTTPProbe(details.URI)
}
for _, f := range cfg.Files {
lower := strings.ToLower(f.Filename)
if strings.Contains(lower, "mmproj") {
continue
}
if !strings.HasSuffix(lower, ".gguf") {
continue
}
return resolveHTTPProbe(f.URI)
}
return ""
}
// resolveHTTPProbe resolves an importer-side URI to the HTTP(S) URL that
// `gguf.ParseGGUFFileRemote` can range-fetch. Returns "" if the URI can't
// be reduced to an HTTP(S) endpoint (e.g. local path, unsupported scheme).
func resolveHTTPProbe(uri string) string {
resolved := downloader.URI(uri).ResolveURL()
if downloader.URI(resolved).LooksLikeHTTPURL() {
return resolved
}
return ""
}
// appendShardGroup copies every shard of group into cfg.Files under dest, // appendShardGroup copies every shard of group into cfg.Files under dest,
// skipping any entry whose target filename is already present so repeated // skipping any entry whose target filename is already present so repeated
// calls (e.g. the rare case of mmproj + model picking the same group) // calls (e.g. the rare case of mmproj + model picking the same group)

View File

@@ -77,7 +77,7 @@ func InstallModelFromGallery(
modelGalleries, backendGalleries []lconfig.Gallery, modelGalleries, backendGalleries []lconfig.Gallery,
systemState *system.SystemState, systemState *system.SystemState,
modelLoader *model.ModelLoader, modelLoader *model.ModelLoader,
name string, req GalleryModel, downloadStatus func(string, string, string, float64), enforceScan, automaticallyInstallBackend bool) error { name string, req GalleryModel, downloadStatus func(string, string, string, float64), enforceScan, automaticallyInstallBackend, requireBackendIntegrity bool) error {
applyModel := func(model *GalleryModel) error { applyModel := func(model *GalleryModel) error {
name = strings.ReplaceAll(name, string(os.PathSeparator), "__") name = strings.ReplaceAll(name, string(os.PathSeparator), "__")
@@ -137,7 +137,7 @@ func InstallModelFromGallery(
if automaticallyInstallBackend && installedModel.Backend != "" { if automaticallyInstallBackend && installedModel.Backend != "" {
xlog.Debug("Installing backend", "backend", installedModel.Backend) xlog.Debug("Installing backend", "backend", installedModel.Backend)
if err := InstallBackendFromGallery(ctx, backendGalleries, systemState, modelLoader, installedModel.Backend, downloadStatus, false); err != nil { if err := InstallBackendFromGallery(ctx, backendGalleries, systemState, modelLoader, installedModel.Backend, downloadStatus, false, requireBackendIntegrity); err != nil {
return err return err
} }
} }

View File

@@ -89,7 +89,7 @@ var _ = Describe("Model test", func() {
Expect(models[0].URL).To(Equal(bertEmbeddingsURL)) Expect(models[0].URL).To(Equal(bertEmbeddingsURL))
Expect(models[0].Installed).To(BeFalse()) Expect(models[0].Installed).To(BeFalse())
err = InstallModelFromGallery(context.TODO(), galleries, []config.Gallery{}, systemState, nil, "test@bert", GalleryModel{}, func(s1, s2, s3 string, f float64) {}, true, true) err = InstallModelFromGallery(context.TODO(), galleries, []config.Gallery{}, systemState, nil, "test@bert", GalleryModel{}, func(s1, s2, s3 string, f float64) {}, true, true, false)
Expect(err).ToNot(HaveOccurred()) Expect(err).ToNot(HaveOccurred())
dat, err := os.ReadFile(filepath.Join(tempdir, "bert.yaml")) dat, err := os.ReadFile(filepath.Join(tempdir, "bert.yaml"))

View File

@@ -232,7 +232,7 @@ func summarizeNodeDrift(nodes []NodeBackendRef) (majority struct{ version, diges
// UpgradeBackend upgrades a single backend to the latest gallery version using // UpgradeBackend upgrades a single backend to the latest gallery version using
// an atomic swap with backup-based rollback on failure. // an atomic swap with backup-based rollback on failure.
func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, galleries []config.Gallery, backendName string, downloadStatus func(string, string, string, float64)) error { func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, galleries []config.Gallery, backendName string, downloadStatus func(string, string, string, float64), requireIntegrity bool) error {
// Look up the installed backend // Look up the installed backend
installedBackends, err := ListSystemBackends(systemState) installedBackends, err := ListSystemBackends(systemState)
if err != nil { if err != nil {
@@ -251,7 +251,7 @@ func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelL
// If this is a meta backend, recursively upgrade the concrete backend it points to // If this is a meta backend, recursively upgrade the concrete backend it points to
if installed.Metadata != nil && installed.Metadata.MetaBackendFor != "" { if installed.Metadata != nil && installed.Metadata.MetaBackendFor != "" {
xlog.Info("Meta backend detected, upgrading concrete backend", "meta", backendName, "concrete", installed.Metadata.MetaBackendFor) xlog.Info("Meta backend detected, upgrading concrete backend", "meta", backendName, "concrete", installed.Metadata.MetaBackendFor)
return UpgradeBackend(ctx, systemState, modelLoader, galleries, installed.Metadata.MetaBackendFor, downloadStatus) return UpgradeBackend(ctx, systemState, modelLoader, galleries, installed.Metadata.MetaBackendFor, downloadStatus, requireIntegrity)
} }
// Find the gallery entry // Find the gallery entry
@@ -265,6 +265,16 @@ func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelL
return fmt.Errorf("no gallery entry found for backend %q", backendName) return fmt.Errorf("no gallery entry found for backend %q", backendName)
} }
// Resolve integrity options (cosign verifier for OCI URIs, strict-mode
// gate for missing SHA256/policy) BEFORE writing anything to disk.
// Without this, the upgrade path would atomically swap in an
// unverified backend even when the gallery has a verification policy
// — see backendDownloadOptions in backends.go.
downloadOpts, err := backendDownloadOptions(galleryEntry, requireIntegrity)
if err != nil {
return fmt.Errorf("upgrade %q: %w", backendName, err)
}
backendPath := filepath.Join(systemState.Backend.BackendsPath, backendName) backendPath := filepath.Join(systemState.Backend.BackendsPath, backendName)
tmpPath := backendPath + ".upgrade-tmp" tmpPath := backendPath + ".upgrade-tmp"
backupPath := backendPath + ".backup" backupPath := backendPath + ".backup"
@@ -285,7 +295,7 @@ func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelL
return fmt.Errorf("failed to copy backend from directory: %w", err) return fmt.Errorf("failed to copy backend from directory: %w", err)
} }
} else { } else {
if err := uri.DownloadFileWithContext(ctx, tmpPath, "", 1, 1, downloadStatus); err != nil { if err := uri.DownloadFileWithContext(ctx, tmpPath, galleryEntry.SHA256, 1, 1, downloadStatus, downloadOpts...); err != nil {
os.RemoveAll(tmpPath) os.RemoveAll(tmpPath)
return fmt.Errorf("failed to download backend: %w", err) return fmt.Errorf("failed to download backend: %w", err)
} }

View File

@@ -383,7 +383,7 @@ var _ = Describe("Upgrade Detection and Execution", func() {
}) })
ml := model.NewModelLoader(systemState) ml := model.NewModelLoader(systemState)
err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil) err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil, false)
Expect(err).NotTo(HaveOccurred()) Expect(err).NotTo(HaveOccurred())
// Verify run.sh was updated // Verify run.sh was updated
@@ -417,7 +417,7 @@ var _ = Describe("Upgrade Detection and Execution", func() {
}) })
ml := model.NewModelLoader(systemState) ml := model.NewModelLoader(systemState)
err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil) err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil, false)
Expect(err).To(HaveOccurred()) Expect(err).To(HaveOccurred())
// Verify v1 is still intact // Verify v1 is still intact
@@ -432,5 +432,41 @@ var _ = Describe("Upgrade Detection and Execution", func() {
Expect(json.Unmarshal(metaData, &meta)).To(Succeed()) Expect(json.Unmarshal(metaData, &meta)).To(Succeed())
Expect(meta.Version).To(Equal("1.0.0")) Expect(meta.Version).To(Equal("1.0.0"))
}) })
// Regression: an earlier version of UpgradeBackend wrote the
// downloaded bytes to disk without going through
// backendDownloadOptions, so the gallery's verification policy
// (and strict-integrity gate) didn't apply on upgrade. This test
// pins the upgrade path to the same integrity gate as installs:
// strict mode + an OCI URI without a verification: block must
// hard-fail *before* anything is downloaded or swapped in.
It("should refuse to upgrade an OCI backend that bypasses integrity in strict mode", func() {
installBackendWithVersion("my-backend", "1.0.0", "#!/bin/sh\necho v1")
// OCI URI, no Gallery.Verification → backendDownloadOptions
// returns a strict-integrity error before any network call.
writeGalleryYAML([]GalleryBackend{
{
Metadata: Metadata{
Name: "my-backend",
},
URI: "oci://example.invalid/missing:never-fetched",
Version: "2.0.0",
},
})
ml := model.NewModelLoader(systemState)
err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil, true)
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("strict integrity"))
// The installed v1 must be untouched — the upgrade should
// have aborted before writing anything.
content, err := os.ReadFile(filepath.Join(backendsPath, "my-backend", "run.sh"))
Expect(err).NotTo(HaveOccurred())
Expect(string(content)).To(Equal("#!/bin/sh\necho v1"))
Expect(filepath.Join(backendsPath, "my-backend.upgrade-tmp")).NotTo(BeAnExistingFile())
Expect(filepath.Join(backendsPath, "my-backend.backup")).NotTo(BeAnExistingFile())
})
}) })
}) })

View File

@@ -28,6 +28,7 @@ import (
"github.com/mudler/LocalAI/core/services/monitoring" "github.com/mudler/LocalAI/core/services/monitoring"
"github.com/mudler/LocalAI/core/services/nodes" "github.com/mudler/LocalAI/core/services/nodes"
"github.com/mudler/LocalAI/core/services/quantization" "github.com/mudler/LocalAI/core/services/quantization"
"github.com/mudler/LocalAI/pkg/signals"
"github.com/mudler/xlog" "github.com/mudler/xlog"
) )
@@ -267,9 +268,12 @@ func API(application *application.Application) (*echo.Echo, error) {
e.Static("/generated-videos", videoPath) e.Static("/generated-videos", videoPath)
} }
// Initialize usage recording when auth DB is available // Initialize usage recording when auth DB is available, and ensure the
// batcher drains its in-memory queue on graceful shutdown so the last
// few seconds of usage don't disappear when the process exits.
if application.AuthDB() != nil { if application.AuthDB() != nil {
httpMiddleware.InitUsageRecorder(application.AuthDB()) httpMiddleware.InitUsageRecorder(application.AuthDB())
signals.RegisterGracefulTerminationHandler(httpMiddleware.ShutdownUsageRecorder)
} }
// Auth is applied to _all_ endpoints. Filtering out endpoints to bypass is // Auth is applied to _all_ endpoints. Filtering out endpoints to bypass is
@@ -403,7 +407,7 @@ func API(application *application.Application) (*echo.Echo, error) {
} }
} }
routes.RegisterNodeSelfServiceRoutes(e, registry, distCfg.RegistrationToken, distCfg.AutoApproveNodes, application.AuthDB(), application.ApplicationConfig().Auth.APIKeyHMACSecret) routes.RegisterNodeSelfServiceRoutes(e, registry, distCfg.RegistrationToken, distCfg.AutoApproveNodes, application.AuthDB(), application.ApplicationConfig().Auth.APIKeyHMACSecret)
routes.RegisterNodeAdminRoutes(e, registry, remoteUnloader, adminMiddleware, application.AuthDB(), application.ApplicationConfig().Auth.APIKeyHMACSecret, application.ApplicationConfig().Distributed.RegistrationToken) routes.RegisterNodeAdminRoutes(e, registry, remoteUnloader, application.GalleryService(), opcache, application.ApplicationConfig(), adminMiddleware, application.AuthDB(), application.ApplicationConfig().Auth.APIKeyHMACSecret, application.ApplicationConfig().Distributed.RegistrationToken)
// Distributed SSE routes (job progress + agent events via NATS) // Distributed SSE routes (job progress + agent events via NATS)
if d := application.Distributed(); d != nil { if d := application.Distributed(); d != nil {
@@ -443,6 +447,25 @@ func API(application *application.Application) (*echo.Echo, error) {
baseTag := `<base href="` + httpMiddleware.SecureBaseHref(baseURL) + `" />` baseTag := `<base href="` + httpMiddleware.SecureBaseHref(baseURL) + `" />`
indexHTML = []byte(strings.Replace(string(indexHTML), "<head>", "<head>\n "+baseTag, 1)) indexHTML = []byte(strings.Replace(string(indexHTML), "<head>", "<head>\n "+baseTag, 1))
} }
// <base href> only changes how relative URLs resolve; path-absolute
// URLs (those starting with `/`) still resolve against the origin
// and would bypass the reverse-proxy prefix. Rewrite the internal
// path-absolute references emitted by the build so the browser
// requests them through the proxy under the prefix.
//
// HTML-escape the prefix before interpolating it into attributes:
// BasePathPrefix already gates X-Forwarded-Prefix via
// SafeForwardedPrefix, but the validator only blocks open-redirect
// shapes (// prefix, backslashes, control chars), not attribute
// breakout characters like `"`. Escaping makes this resilient
// even if the validator ever loosens.
if prefix := httpMiddleware.BasePathPrefix(c); prefix != "/" {
safePrefix := httpMiddleware.SecureBaseHref(prefix)
html := string(indexHTML)
html = strings.ReplaceAll(html, `="/assets/`, `="`+safePrefix+`assets/`)
html = strings.ReplaceAll(html, `="/favicon.svg"`, `="`+safePrefix+`favicon.svg"`)
indexHTML = []byte(html)
}
return c.HTMLBlob(http.StatusOK, indexHTML) return c.HTMLBlob(http.StatusOK, indexHTML)
} }

View File

@@ -446,6 +446,42 @@ var _ = Describe("API test", func() {
Expect(sc).To(Equal(200), "status code") Expect(sc).To(Equal(200), "status code")
Expect(string(body)).To(ContainSubstring(`<base href="https://example.org/myprefix/" />`), "body") Expect(string(body)).To(ContainSubstring(`<base href="https://example.org/myprefix/" />`), "body")
}) })
// Caddy's `handle_path` (and similar directives) strip the matched
// prefix before forwarding upstream, so LocalAI receives the
// already-stripped path together with X-Forwarded-Prefix. The base
// href and asset URLs must still include the prefix so the browser
// requests them through the proxy.
It("Should support reverse-proxy when prefix is stripped by the proxy", func() {
err, sc, body := getRequest("http://127.0.0.1:9090/app", http.Header{
"X-Forwarded-Proto": {"https"},
"X-Forwarded-Host": {"example.org"},
"X-Forwarded-Prefix": {"/myprefix"},
})
Expect(err).To(BeNil(), "error")
Expect(sc).To(Equal(200), "status code")
Expect(string(body)).To(ContainSubstring(`<base href="https://example.org/myprefix/" />`), "body")
Expect(string(body)).ToNot(ContainSubstring(`="/assets/`), "asset URLs must include the prefix")
Expect(string(body)).ToNot(ContainSubstring(`="/favicon.svg"`), "favicon URL must include the prefix")
})
// X-Forwarded-Prefix is attacker controllable on misconfigured
// proxy chains. A value like "//evil.com" would otherwise turn the
// asset URL rewrite into a protocol-relative URL that loads JS
// from a foreign origin. BasePathPrefix must reject these via
// SafeForwardedPrefix and fall back to "/".
It("Should ignore an unsafe X-Forwarded-Prefix and not poison asset URLs", func() {
err, sc, body := getRequest("http://127.0.0.1:9090/app", http.Header{
"X-Forwarded-Proto": {"https"},
"X-Forwarded-Host": {"example.org"},
"X-Forwarded-Prefix": {"//evil.com"},
})
Expect(err).To(BeNil(), "error")
Expect(sc).To(Equal(200), "status code")
Expect(string(body)).ToNot(ContainSubstring("evil.com"), "unsafe prefix must not leak into the response")
Expect(string(body)).ToNot(ContainSubstring(`="//`), "asset URLs must not become protocol-relative")
})
}) })
Context("Applying models", func() { Context("Applying models", func() {

View File

@@ -38,9 +38,15 @@ func InitDB(databaseURL string) (*gorm.DB, error) {
} }
// Backfill: users created before the provider column existed have an empty // Backfill: users created before the provider column existed have an empty
// provider treat them as local accounts so the UI can identify them. // provider - treat them as local accounts so the UI can identify them.
db.Exec("UPDATE users SET provider = ? WHERE provider = '' OR provider IS NULL", ProviderLocal) db.Exec("UPDATE users SET provider = ? WHERE provider = '' OR provider IS NULL", ProviderLocal)
// Backfill: pre-feature usage_records have no source column. Classify them so the
// new per-source aggregators include them.
if err := BackfillUsageSource(db); err != nil {
return nil, fmt.Errorf("failed to backfill usage source: %w", err)
}
// Create composite index on users(provider, subject) for fast OAuth lookups // Create composite index on users(provider, subject) for fast OAuth lookups
if err := db.Exec("CREATE INDEX IF NOT EXISTS idx_users_provider_subject ON users(provider, subject)").Error; err != nil { if err := db.Exec("CREATE INDEX IF NOT EXISTS idx_users_provider_subject ON users(provider, subject)").Error; err != nil {
// Ignore error on postgres if index already exists // Ignore error on postgres if index already exists

View File

@@ -16,8 +16,10 @@ import (
) )
const ( const (
contextKeyUser = "auth_user" contextKeyUser = "auth_user"
contextKeyRole = "auth_role" contextKeyRole = "auth_role"
contextKeyAPIKey = "auth_apikey"
contextKeySource = "auth_source"
) )
// Middleware returns an Echo middleware that handles authentication. // Middleware returns an Echo middleware that handles authentication.
@@ -75,6 +77,7 @@ func Middleware(db *gorm.DB, appConfig *config.ApplicationConfig) echo.Middlewar
} }
c.Set(contextKeyUser, syntheticUser) c.Set(contextKeyUser, syntheticUser)
c.Set(contextKeyRole, RoleAdmin) c.Set(contextKeyRole, RoleAdmin)
c.Set(contextKeySource, UsageSourceLegacy)
authenticated = true authenticated = true
} }
} }
@@ -213,6 +216,20 @@ func GetUserRole(c echo.Context) string {
return role return role
} }
// GetAPIKey returns the resolved API key from the echo context, or nil.
// Nil for session-cookie and legacy-env-key authentication.
func GetAPIKey(c echo.Context) *UserAPIKey {
k, _ := c.Get(contextKeyAPIKey).(*UserAPIKey)
return k
}
// GetSource returns the request's authentication source: UsageSourceAPIKey,
// UsageSourceWeb, UsageSourceLegacy, or empty if no authentication was performed.
func GetSource(c echo.Context) string {
s, _ := c.Get(contextKeySource).(string)
return s
}
// RequireRouteFeature returns a global middleware that checks the user has access // RequireRouteFeature returns a global middleware that checks the user has access
// to the feature required by the matched route. It uses the RouteFeatureRegistry // to the feature required by the matched route. It uses the RouteFeatureRegistry
// to look up the required feature for each route pattern + HTTP method. // to look up the required feature for each route pattern + HTTP method.
@@ -421,47 +438,67 @@ func RequireQuota(db *gorm.DB) echo.MiddlewareFunc {
} }
// tryAuthenticate attempts to authenticate the request using the database. // tryAuthenticate attempts to authenticate the request using the database.
//
// On success it returns the user and, as a side effect, sets the following
// values on the Echo context:
// - contextKeySource ("auth_source"): always set, one of UsageSourceWeb /
// UsageSourceAPIKey. UsageSourceLegacy is set elsewhere by the parent
// Middleware when a legacy env key matches.
// - contextKeyAPIKey ("auth_apikey"): set to the resolved *UserAPIKey for
// named-key branches (Bearer, x-api-key, xi-api-key, token cookie).
// - "_auth_session": session record, used by Middleware to drive cookie
// rotation. Only set on the session-cookie branch.
//
// contextKeyUser and contextKeyRole are populated by the parent Middleware
// after this function returns.
func tryAuthenticate(c echo.Context, db *gorm.DB, appConfig *config.ApplicationConfig) *User { func tryAuthenticate(c echo.Context, db *gorm.DB, appConfig *config.ApplicationConfig) *User {
hmacSecret := appConfig.Auth.APIKeyHMACSecret hmacSecret := appConfig.Auth.APIKeyHMACSecret
// a. Session cookie // a. Session cookie -> web UI
if cookie, err := c.Cookie(sessionCookie); err == nil && cookie.Value != "" { if cookie, err := c.Cookie(sessionCookie); err == nil && cookie.Value != "" {
if user, session := ValidateSession(db, cookie.Value, hmacSecret); user != nil { if user, session := ValidateSession(db, cookie.Value, hmacSecret); user != nil {
// Store session for rotation check in middleware // Store session for rotation check in middleware
c.Set("_auth_session", session) c.Set("_auth_session", session)
c.Set(contextKeySource, UsageSourceWeb)
return user return user
} }
} }
// b. Authorization: Bearer token // b. Authorization: Bearer
authHeader := c.Request().Header.Get("Authorization") authHeader := c.Request().Header.Get("Authorization")
if strings.HasPrefix(authHeader, "Bearer ") { if strings.HasPrefix(authHeader, "Bearer ") {
token := strings.TrimPrefix(authHeader, "Bearer ") token := strings.TrimPrefix(authHeader, "Bearer ")
// Try as session ID first // b1. Session token via Bearer -> still web UI
if user, _ := ValidateSession(db, token, hmacSecret); user != nil { if user, _ := ValidateSession(db, token, hmacSecret); user != nil {
c.Set(contextKeySource, UsageSourceWeb)
return user return user
} }
// Try as user API key // b2. Named API key
if key, err := ValidateAPIKey(db, token, hmacSecret); err == nil { if key, err := ValidateAPIKey(db, token, hmacSecret); err == nil {
c.Set(contextKeySource, UsageSourceAPIKey)
c.Set(contextKeyAPIKey, key)
return &key.User return &key.User
} }
} }
// c. x-api-key / xi-api-key headers // c. x-api-key / xi-api-key -> named API key
for _, header := range []string{"x-api-key", "xi-api-key"} { for _, header := range []string{"x-api-key", "xi-api-key"} {
if key := c.Request().Header.Get(header); key != "" { if k := c.Request().Header.Get(header); k != "" {
if apiKey, err := ValidateAPIKey(db, key, hmacSecret); err == nil { if apiKey, err := ValidateAPIKey(db, k, hmacSecret); err == nil {
c.Set(contextKeySource, UsageSourceAPIKey)
c.Set(contextKeyAPIKey, apiKey)
return &apiKey.User return &apiKey.User
} }
} }
} }
// d. token cookie (legacy) // d. token cookie -> named API key
if cookie, err := c.Cookie("token"); err == nil && cookie.Value != "" { if cookie, err := c.Cookie("token"); err == nil && cookie.Value != "" {
// Try as user API key
if key, err := ValidateAPIKey(db, cookie.Value, hmacSecret); err == nil { if key, err := ValidateAPIKey(db, cookie.Value, hmacSecret); err == nil {
c.Set(contextKeySource, UsageSourceAPIKey)
c.Set(contextKeyAPIKey, key)
return &key.User return &key.User
} }
} }

View File

@@ -303,4 +303,122 @@ var _ = Describe("Auth Middleware", func() {
} }
}) })
}) })
Describe("auth context plumbing for usage source", func() {
// probeApp builds a minimal echo app with the auth middleware and a single
// "/probe" route that captures the user, source, and apikey from context.
type probe struct {
user *auth.User
source string
key *auth.UserAPIKey
}
probeApp := func(db *gorm.DB, appConfig *config.ApplicationConfig, p *probe) *echo.Echo {
e := echo.New()
e.Use(auth.Middleware(db, appConfig))
e.GET("/probe", func(c echo.Context) error {
p.user = auth.GetUser(c)
p.source = auth.GetSource(c)
p.key = auth.GetAPIKey(c)
return c.NoContent(http.StatusOK)
})
return e
}
It("session cookie sets source=web, apikey=nil", func() {
db := testDB()
appConfig := config.NewApplicationConfig()
user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
token := createTestSession(db, user.ID)
var p probe
app := probeApp(db, appConfig, &p)
rec := doRequest(app, http.MethodGet, "/probe", withSessionCookie(token))
Expect(rec.Code).To(Equal(http.StatusOK))
Expect(p.user).ToNot(BeNil())
Expect(p.user.ID).To(Equal(user.ID))
Expect(p.source).To(Equal(auth.UsageSourceWeb))
Expect(p.key).To(BeNil())
})
It("Bearer session token sets source=web, apikey=nil", func() {
db := testDB()
appConfig := config.NewApplicationConfig()
user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
token := createTestSession(db, user.ID)
var p probe
app := probeApp(db, appConfig, &p)
rec := doRequest(app, http.MethodGet, "/probe", withBearerToken(token))
Expect(rec.Code).To(Equal(http.StatusOK))
Expect(p.user).ToNot(BeNil())
Expect(p.user.ID).To(Equal(user.ID))
Expect(p.source).To(Equal(auth.UsageSourceWeb))
Expect(p.key).To(BeNil())
})
It("Bearer API key sets source=apikey and exposes the resolved *UserAPIKey", func() {
db := testDB()
appConfig := config.NewApplicationConfig()
user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
plaintext, key, err := auth.CreateAPIKey(db, user.ID, "ci", auth.RoleUser, appConfig.Auth.APIKeyHMACSecret, nil)
Expect(err).ToNot(HaveOccurred())
var p probe
app := probeApp(db, appConfig, &p)
rec := doRequest(app, http.MethodGet, "/probe", withBearerToken(plaintext))
Expect(rec.Code).To(Equal(http.StatusOK))
Expect(p.source).To(Equal(auth.UsageSourceAPIKey))
Expect(p.key).ToNot(BeNil())
Expect(p.key.ID).To(Equal(key.ID))
})
It("x-api-key header sets source=apikey", func() {
db := testDB()
appConfig := config.NewApplicationConfig()
user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
plaintext, _, err := auth.CreateAPIKey(db, user.ID, "ci", auth.RoleUser, appConfig.Auth.APIKeyHMACSecret, nil)
Expect(err).ToNot(HaveOccurred())
var p probe
app := probeApp(db, appConfig, &p)
rec := doRequest(app, http.MethodGet, "/probe", withXApiKey(plaintext))
Expect(rec.Code).To(Equal(http.StatusOK))
Expect(p.source).To(Equal(auth.UsageSourceAPIKey))
Expect(p.key).ToNot(BeNil())
})
It("token cookie sets source=apikey", func() {
db := testDB()
appConfig := config.NewApplicationConfig()
user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
plaintext, _, err := auth.CreateAPIKey(db, user.ID, "ci", auth.RoleUser, appConfig.Auth.APIKeyHMACSecret, nil)
Expect(err).ToNot(HaveOccurred())
var p probe
app := probeApp(db, appConfig, &p)
rec := doRequest(app, http.MethodGet, "/probe", withTokenCookie(plaintext))
Expect(rec.Code).To(Equal(http.StatusOK))
Expect(p.source).To(Equal(auth.UsageSourceAPIKey))
Expect(p.key).ToNot(BeNil())
})
It("legacy env key sets source=legacy, apikey=nil", func() {
db := testDB()
appConfig := config.NewApplicationConfig()
appConfig.ApiKeys = []string{"legacy-secret"}
var p probe
app := probeApp(db, appConfig, &p)
rec := doRequest(app, http.MethodGet, "/probe", withBearerToken("legacy-secret"))
Expect(rec.Code).To(Equal(http.StatusOK))
Expect(p.source).To(Equal(auth.UsageSourceLegacy))
Expect(p.key).To(BeNil())
})
})
}) })

Some files were not shown because too many files have changed in this diff Show More