Compare commits

..

31 Commits

Author SHA1 Message Date
Ettore Di Giacinto
1a30020a82 ci(backend-signing): set COSIGN_EXPERIMENTAL=1 for oci-1-1 referrers mode
cosign v2.4.1 still gates --registry-referrers-mode=oci-1-1 behind the
experimental flag, so the first signing run after the backend-signing
merge failed with "you must set COSIGN_EXPERIMENTAL=1". Set it at the
job env level so both the quay and dockerhub cosign steps inherit it,
and note the requirement in .agents/backend-signing.md so a future
cosign bump can drop the flag.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-05-24 08:21:05 +00:00
LocalAI [bot]
8bbe89a537 fix(distributed): route per request across loaded replicas + cache probeHealth (#9968)
* refactor(distributed): extract PickBestReplica from FindAndLockNodeWithModel

Lifts the replica-selection policy (in_flight ASC, last_used ASC,
available_vram DESC) out of the SQL ORDER BY into a pure Go function in
the new replicapicker.go. The SQL clause keeps its FOR UPDATE atomicity
and remains the production path used by SmartRouter; PickBestReplica is
the canonical implementation that the future per-frontend rotating
replica cache (TODO referenced from pkg/model) will call against an
in-memory snapshot without paying a DB round-trip per inference.

A new registry_test mirror spec seeds a multi-tier scenario and asserts
both layers pick the same replica, so any future tweak to either side
fails the test until the other side is updated.

No behavior change.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* fix(distributed): route per inference request and cache probeHealth

Two related fixes that together restore load balancing across loaded
replicas of the same model.

1. ModelLoader.Load and LoadModel bypass the local *Model cache when
   modelRouter is set. The cached *Model wraps an InFlightTrackingClient
   bound to a single (nodeID, replicaIndex) — reusing it pinned every
   subsequent request to whichever node won the very first pick, so
   FindAndLockNodeWithModel's round-robin never got a chance to run
   even after the reconciler scaled the model out to a second node. In
   distributed mode SmartRouter.Route now runs per request, and
   PickBestReplica picks the least-loaded replica each time.

   SmartRouter has its own coalescing (advisory DB lock for first-time
   loads + singleflight on backend.install RPC) so concurrent first
   requests for a not-yet-loaded model still produce a single worker
   side install.

2. SmartRouter.probeHealth memoizes successful gRPC HealthCheck results
   in a new probeCache (probe_cache.go) with a 30s TTL. With per-request
   routing every inference call hits probeHealth, and llama.cpp-style
   backends serialize HealthCheck behind active Predict — so a burst of
   incoming requests stalled on the probe to a node already mid-stream,
   tripping the 2s timeout and falling through to the install path.
   singleflight collapses N concurrent first-time probes for the same
   (node, addr) into one round-trip, failed probes invalidate the entry
   so the staleness-recovery path still triggers, and the TTL matches
   pkg/model/model.go's healthCheckTTL so the single-process and
   distributed paths share a staleness budget. The background
   HealthMonitor still reaps actually-dead backends within ~45s.

The bypass introduces one short FindAndLockNodeWithModel transaction per
inference. A TODO in pkg/model/loader.go documents the future per modelID
rotating-replica cache that would reuse PickBestReplica against an
in-memory snapshot and skip the DB round-trip for hot paths.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-24 08:15:27 +00:00
LocalAI [bot]
dcc5599f89 chore: ⬆️ Update leejet/stable-diffusion.cpp to a397e03488cc27e1a42da646b82dfce9f50741c0 (#9965)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-24 08:35:36 +02:00
LocalAI [bot]
a95f4e63e0 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 642c038ccdf3dd08e6d9ac6fdc3b1c311ebd8a02 (#9966)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 23:52:51 +02:00
LocalAI [bot]
dfd19a3f88 chore: ⬆️ Update ggml-org/llama.cpp to c0c7e147e7efa6c5858754b47259ba4880f8a906 (#9963)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 23:52:36 +02:00
LocalAI [bot]
d7387c725c feat(swagger): update swagger (#9962)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 23:52:10 +02:00
LocalAI [bot]
63d84a5705 chore: ⬆️ Update antirez/ds4 to 444afce822057d87f14c4dec307dce24fd49b3ee (#9964)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 23:51:53 +02:00
LocalAI [bot]
1198d10b58 fix(traces): cap backend trace Data to keep admin UI responsive (#9960)
* fix(traces): cap backend trace Data field so the admin UI stays responsive

The previous fix (#9946) capped API trace bodies but missed backend traces,
which carry the same blast radius:

  - LLM backend traces store the full chat messages JSON, full response, and
    full streaming deltas. Every agent-pool reasoning step ships the full
    RAG-augmented history (50-500 KiB per trace, often 100+ traces queued).
  - TTS / audio_transform / transcript traces embed a 30s audio snippet as
    base64, around 1.3 MiB per trace.

Both blow the /api/backend-traces JSON past tens of MiB. The admin Traces
page then keeps re-downloading and re-parsing the buffer faster than the
5s auto-refresh and stays in the loading state forever, the same symptom
the API-side fix addressed.

Apply two complementary caps, both honoring LOCALAI_TRACING_MAX_BODY_BYTES:

Option A (safety net in core/trace): RecordBackendTrace walks the Data map
recursively and replaces any string value larger than the cap with
"<truncated: N bytes>". Catches anything a future producer forgets.

Option B (head-preserving at the producer):
  - core/backend/llm.go: TruncateToBytes on messages, response, and
    chat_deltas content/reasoning_content so the leading content stays
    readable in the UI.
  - core/trace/audio_snippet.go: omit audio_wav_base64 when the encoded
    blob would exceed the cap (truncated base64 is undecodable). The
    quality metrics still ship and the UI's WaveformPlayer simply skips
    when the field is absent.

TruncateToBytes is bounded to <= maxBytes so Option A leaves the producer's
head-preserving output alone instead of replacing it with the bare marker.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

* fix(react-ui): expose tracing_max_body_bytes in Settings and Traces panels

The setting was already plumbed through env (LOCALAI_TRACING_MAX_BODY_BYTES),
CLI flag, and the runtime_settings.json GET/PUT schema, but neither the main
Settings page nor the inline Traces panel offered an input for it. Admins
hitting the "Traces UI stuck loading" symptom had to know to set an env var
or PUT raw JSON to /api/settings to dial the cap.

Add a "Max Body Bytes" row next to "Max Items" in both places. Same input
type, same disabled-when-tracing-off semantics, placeholder shows the 65536
default so users see what they're inheriting.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

* test(react-ui): disambiguate Max Items locator after adding Max Body Bytes

The Tracing settings panel now has two number inputs. The previous spec
matched 'input[type="number"]' which became ambiguous and triggered a
Playwright strict-mode violation in CI. Switch to getByPlaceholder('100')
for Max Items and add a parallel spec for the new Max Body Bytes field
using getByPlaceholder('65536').

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-23 14:50:40 +02:00
LocalAI [bot]
a0f3e26245 fix(distributed): make admin backend installs resilient and observable (#9958)
* feat(distributed): add configurable NATS backend install/upgrade timeouts

Adds BackendInstallTimeout and BackendUpgradeTimeout to DistributedConfig
with 15m defaults, following the existing MCPToolTimeout / WorkerWaitTimeout
pattern. These will replace the hardcoded literals in RemoteUnloaderAdapter
so admin-driven backend installs across the cluster survive long OCI image
pulls that previously timed out at 3m.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* style(distributed): gofmt alignment after timeout fields

Re-aligns the Validate() negative-duration map and the Default* const
block so the new BackendInstall/UpgradeTimeout entries do not leave
the surrounding columns mis-padded.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(cli): surface LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT and _UPGRADE_TIMEOUT

Parses the two new env vars on the run CLI and threads them through the
existing AppOption builder so DistributedConfig picks them up. Invalid
duration strings now fail loudly at startup rather than silently falling
back to the default.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): inject NATS install/upgrade timeouts into RemoteUnloaderAdapter

Removes the hardcoded 3m / 15m literals from RemoteUnloaderAdapter and
threads in DistributedConfig.BackendInstallTimeoutOrDefault() and
BackendUpgradeTimeoutOrDefault() at construction. Install now defaults
to 15m (was 3m); cold OCI image pulls on Jetson Wi-Fi routinely blew
past the old ceiling. Scripted messaging client captures the timeout
so tests can assert the configured value actually reaches the NATS
request.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): introduce galleryop.ErrWorkerStillInstalling sentinel

When the NATS request-reply for backend.install (or .upgrade) times out
the worker is almost always still pulling the OCI image. Wrap the timeout
in a typed sentinel so the manager above can distinguish "worker hung"
from "worker still working" and leave the pending_backend_ops row in
place for the reconciler to confirm via backend.list.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): treat NATS install timeout as in-progress, not failure

When a worker times out replying to backend.install but the install is
still running on the worker, enqueueAndDrainBackendOp now reports a
running_on_worker status and pushes NextRetryAt out by the install
timeout so the reconciler does not immediately re-fire another install
while the worker is still pulling the image. The pending_backend_ops
row stays in place for the next reconciler pass to confirm via
backend.list.

InstallBackend wraps the result in galleryop.ErrWorkerStillInstalling
so callers can branch (galleryop renders yellow in-progress instead of
red error). UpgradeBackend uses the same wrap.

Adds RemoteUnloaderAdapter.InstallTimeout() so the manager can push
NextRetryAt by the configured timeout without reaching into a private
field, and NodeRegistry.RecordPendingBackendOpInFlight as the soft
cousin of RecordPendingBackendOpFailure.

Also includes incidental gofmt-driven struct-field alignment in
registry.go on lines unrelated to the change (touched files are
re-formatted to canonical form per project policy).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): don't increment Attempts on in-flight install timeout

An in-flight timeout (worker still pulling the OCI image) is not a
failed attempt, it's a delayed one. Incrementing Attempts let
genuinely-progressing slow installs (e.g. 30 GB CUDA images on Wi-Fi)
trip the reconciler's maxPendingBackendOpAttempts cap and dead-letter
the queue row while the worker was still legitimately working.

RecordPendingBackendOpInFlight now only updates LastError and NextRetryAt.
Also documents "running_on_worker" in the NodeOpStatus.Status enum
comment so Task 6 implementers see the full surface.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(galleryop): surface ErrWorkerStillInstalling as non-error OpStatus

When the distributed backend manager returns an error that wraps
ErrWorkerStillInstalling, backendHandler now completes the op with a
"still installing in background" message rather than marking it as a
red failure. Admin UI sees a yellow in-progress state; reconciler
confirms completion on its next pass.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(distributed): end-to-end install-timeout-then-reconcile

Wires Task 1-6 end-to-end so any seam mismatch surfaces in CI rather
than during a real cluster install. NATS times out, the queue row
stays alive with running_on_worker status, the worker eventually
reports the backend installed via backend.list, the manager surfaces
it via ListBackends.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT / _UPGRADE_TIMEOUT

Add the two new operator-tunable env vars to the Frontend Configuration
table in the distributed-mode docs. Explains the 15m default, when to
raise it (slow links pulling multi-GB OCI images), and the new
"still installing in background" admin-UI state when the round-trip
times out but the worker is still working.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): clear pending install rows when backend.list confirms

DistributedBackendManager.ListBackends now proactively clears
pending_backend_ops install rows whose (nodeID, backend) is reported
installed by backend.list. Operator UI updates immediately instead of
waiting up to installTimeout (default 15m) for the next reconciler
tick after NextRetryAt.

Only install rows are cleared; upgrade and delete intents are not
satisfied by presence in backend.list and continue to drain through
their normal reconciler paths.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(messaging): add BackendInstallProgressEvent wire type and subject

New NATS subject nodes.<nodeID>.backend.install.<opID>.progress lets the
worker publish transient progress events (file, current/total bytes,
percentage, phase) while a long-running install pulls its OCI image.
BackendInstallRequest gains an optional OpID field so the worker knows
which subject to publish on.

Transient pub/sub (not JetStream): the install reply remains ground
truth for success/failure; dropped progress events are tolerable.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* style(messaging): drop em-dash from BackendInstallProgress test comment

Per project convention (no em-dashes anywhere). Comment substance is
unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): worker publishes debounced install progress over NATS

When BackendInstallRequest.OpID is set, the worker's backend.install
handler wires a debounced publisher (250ms window) into the gallery
download callback. Each tick becomes a BackendInstallProgressEvent on
nodes.<nodeID>.backend.install.<opID>.progress; the publisher always
emits a final event on Flush so the UI sees the terminal percentage.

Old masters that do not set OpID continue to run silent installs: no
behavior change for them. Lock ordering: the publisher releases its
mutex before calling messaging.Publish so a slow network never stalls
the install loop.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): RemoteUnloaderAdapter subscribes to install progress

InstallBackend gains opID + onProgress parameters. When both are set,
the adapter subscribes to nodes.<nodeID>.backend.install.<opID>.progress
BEFORE publishing the install request, decodes each message into the
caller's onProgress callback in a goroutine (so a slow callback never
stalls the NATS reader thread), and unsubscribes after RequestJSON
returns.

When onProgress is nil OR opID is empty (the reconciler retry path),
subscription is skipped entirely - silent installs cost nothing extra.

Subscribe failure is logged at Warn and the install proceeds without
progress streaming; the NATS round-trip still owns terminal status.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): forward backend install progress into galleryop OpStatus

DistributedBackendManager.InstallBackend now passes the gallery op ID
and a progress bridge into the adapter call. Each
BackendInstallProgressEvent from the worker becomes a
galleryop.ProgressCallback tick - which the existing backendHandler
already turns into OpStatus.UpdateStatus, so the admin UI/SSE polling
sees per-byte progress for distributed installs without any UI-side
change.

UpgradeBackend is intentionally left silent for now: its wire request
(BackendUpgradeRequest) does not carry OpID, and rolling-update
fallback is the rarer path. Will be picked up in a follow-up if the
worker upgrade path also gets a progress channel.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(distributed): InstallBackend tolerates silent (pre-Phase-2) workers

A worker on pre-Phase-2 code never publishes progress events. The new
master subscribes optimistically; this spec pins that a silent worker
still produces a green install with no progressCb ticks. The install
reply is the source of truth for terminal state; the progress stream
is a best-effort UX enrichment.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document install progress streaming

Note the new nodes.<nodeID>.backend.install.<opID>.progress subject and
the silent-worker compatibility behavior so operators know to expect
real-time progress and what happens on a mixed-version cluster.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): note progress-event ordering trade-off in InstallBackend

Document near the goroutine dispatch why ordering at the consumer is
best-effort, why it rarely matters in practice (worker debounce >>
goroutine jitter), and what a future hardening pass would look like
(Seq field + stale-by-seq drop). Stops the next reader from accidentally
"fixing" the goroutine pool away.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(galleryop): add NodeProgress + OpStatus.Nodes for per-node breakdown

Adds the data model the UI needs to render an expandable per-node
breakdown of a fanned-out backend install. NodeProgress carries node
identity (ID + name), per-node status (queued / running_on_worker /
success / error / downloading), the current file + bytes + percentage
from the Phase 2 progress stream, and any per-node error.

OpStatus.Nodes is the slice the /api/operations handler will surface
in a follow-up.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(galleryop): UpdateNodeProgress merges per-node ticks by NodeID

GalleryService.UpdateNodeProgress(opID, nodeID, np) merges a NodeProgress
into OpStatus.Nodes (keyed by NodeID, no duplicates) and mirrors the
latest tick into the aggregate Progress / FileName /
DownloadedFileSize / TotalFileSize fields so the legacy single-bar
OperationsBar view keeps working unchanged alongside the new per-node
breakdown.

Concurrent-safe via the existing g.Mutex.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): write per-node OpStatus entries during install fan-out

DistributedBackendManager now accepts a nodeProgressSink and feeds it
two streams:

1. enqueueAndDrainBackendOp emits a per-node terminal entry on each
   status it appends to BackendOpResult (queued, success, error,
   running_on_worker). The opID is threaded through the function so
   the sink gets the right gallery op identity.

2. The install apply closure fans each BackendInstallProgressEvent
   into the sink as a downloading entry, alongside the legacy
   progressCb path so the aggregate single-bar view stays correct.

Production wiring passes the GalleryService (which implements
UpdateNodeProgress via Task 2) as the sink. Single-node tests pass
nil. DeleteBackend and UpgradeBackend pass an empty opID so the
sink path no-ops for ops that aren't gallery-tracked the same way
as Install.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(operations): expose per-node breakdown on /api/operations

When an operation's OpStatus has Nodes entries (populated by the
Phase 4 progress sink wiring), surface them as a "nodes" array on the
/api/operations response, sorted by node_name for stable rendering.

Backward compatible: legacy clients ignore the field; ops without any
node entries (single-node mode, model installs) omit the array entirely
thanks to the empty-slice guard.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): per-node breakdown in OperationsBar

When an install op fans out to more than one worker, the operations
bar now shows a "N nodes" chevron that expands into a per-node list.
Each row carries the node's status (color-coded pill), the current
file being downloaded, byte counts, percentage, and a thin per-node
progress bar. Yellow "Worker busy" pill marks running_on_worker
status with a tooltip explaining the NATS round-trip timed out but
the worker is still installing in the background.

Backward compatible: ops without a nodes field (legacy or single-node
mode) render as before. State for expand/collapse is local to the
component, keyed by jobID/id - reload starts collapsed.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document per-node breakdown in the operations bar

Adds a short subsection covering the expandable "N nodes" chevron in
the OperationsBar admin UI, the meaning of each status pill, and
how it relates to the /api/operations nodes array.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(galleryop): UpdateStatus preserves Nodes when caller sends none

Real-world bug surfaced by the Phase 4 multi-worker smoke test: the
nodes[] array in /api/operations flickered between a single node at a
time on a 2-worker install. Root cause: the Phase 2 progress bridge
also calls the legacy progressCb -> UpdateStatus(&OpStatus{...}) on
every tick. UpdateStatus then overwrote the entire status pointer,
wiping the Nodes slice that UpdateNodeProgress had just merged in.

Fix: in UpdateStatus, if the incoming op has an empty Nodes slice,
carry forward the previous status's Nodes before storing. Callers
that explicitly populate Nodes still win (their slice replaces the
prior one, no merge across the two code paths).

Two regression specs added pinning both directions of the contract.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): strip implementation details from user-facing docs

Trim the new install/upgrade timeout rows and the install-progress
sections to focus on what the operator sees and tunes. Drops:

- the NATS subject names and pub/sub mechanics
- "round-trip" / reconciler / backend.list jargon
- /api/operations polling cadence
- "pre-2026-05-22" version references

Reframes the breakdown text around the admin UI (Operations Bar,
chevron, status pills, "Worker busy" tooltip). Implementation context
lives in the agent notes and code comments.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(config): move DistributedConfig.Validate flag names to constants

The negative-duration check map was a wall of literal kebab-case
strings that had to stay in sync with the kong-derived CLI flag names
manually. Move them to a Flag* const block alongside the existing
Default* block so a rename of either the Go field or the CLI naming
convention forces a compile error rather than silent drift.

Sole consumer today is Validate; the constants are exported so future
operator-facing surfaces (e.g. error messages on other validation
paths) can reference them by name instead of repeating the literals.

Tests pin both the literal values (so a future "let's just rename
this" doesn't accidentally regress the CLI flag) and the negative-
duration error message for the new BackendInstall / BackendUpgrade
fields.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(distributed): extract NodeStatus and Phase enums to constants

Sweep for the same literal-string-as-identifier pattern called out on
the Validate flag names: the per-node install status enum
("queued" | "downloading" | "running_on_worker" | "success" | "error")
appeared as raw literals across managers_distributed.go (10+ sites,
including 3 separate `n.Status == "running_on_worker"` checks),
operation.go, and the test suite. Same shape for the Phase enum
("resolving" | "downloading" | "extracting" | "starting") in the
worker-side progress publisher.

Promote both to exported const blocks:

- galleryop.NodeStatus{Queued,Downloading,RunningOnWorker,Success,Error}
  shared between galleryop.NodeProgress.Status (the wire field) and
  nodes.NodeOpStatus.Status (the in-process per-node summary)
- messaging.Phase{Resolving,Downloading,Extracting,Starting}
  shared between the worker publisher and any future consumer that
  needs to switch on phase

Tests pin both the literal values (so a future "let's just rename" doesn't
silently change the JSON wire) and use the constants in setup (so the
producer side stays drift-protected). Wire-format assertions on the
/api/operations JSON output keep their literals deliberately, so the
constant value can never silently diverge from what the UI receives.

Out of scope for this PR (separate cleanup): the finetune and
quantization job-status enums have the same anti-pattern with 14+
literal sites each, but predate this PR's work.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-23 12:35:44 +02:00
LocalAI [bot]
e4cc1f11f3 chore: ⬆️ Update ggml-org/llama.cpp to 1acee6bf8939948f9bcbf4b14034e4b475f06069 (#9952)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 08:38:29 +02:00
LocalAI [bot]
6ed269d0b9 chore: ⬆️ Update ggml-org/whisper.cpp to 0ccd896f5b882628e1c077f9769735ef4ce52860 (#9954)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 08:37:26 +02:00
LocalAI [bot]
5756fb046d chore: ⬆️ Update leejet/stable-diffusion.cpp to 0baf721215f45335a5df8caf0ecb34e870c956e7 (#9955)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 08:37:10 +02:00
Copilot
7980629bc5 Fix backend manifest merge signing on current cosign releases (#9957)
* Initial plan

* fix: remove deprecated cosign bundle flag from backend merge workflow

Agent-Logs-Url: https://github.com/mudler/LocalAI/sessions/4207dabc-14ec-4655-9594-487338977fcf

Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 00:20:28 +02:00
LocalAI [bot]
d0a59be9de chore: ⬆️ Update ikawrakow/ik_llama.cpp to b3d39cff8bffbd67296d6badd4076a1486a0715c (#9953)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-22 23:58:48 +02:00
LocalAI [bot]
5cda4f1ccf fix(L4T13 backends): switch vllm/sglang/vllm-omni to PyPI aarch64+cu130 wheels (#9950)
* fix(vllm): switch L4T13 backend to PyPI aarch64+cu130 wheels

The L4T13 vllm backend pulled torch / torchvision / torchaudio / vllm from
pypi.jetson-ai-lab.io's sbsa/cu130 mirror via [tool.uv.sources] with no
version pins. That mirror started shipping torch 2.11.0 next to a
vllm-0.20.0+cu130 wheel that was still compiled against torch 2.10's c10
ABI, so uv landed on the mismatched pair and vllm crashed at import:

  ImportError: vllm/_C.abi3.so: undefined symbol:
  _ZN3c1013MessageLoggerC1EPKciib

(c10::MessageLogger's constructor signature changed between torch 2.10 and
2.11; the vllm wheel referenced the 2.10 form, the installed libc10.so
exported only the 2.11 form.)

Since torch 2.11 (April 2026) PyPI publishes its own aarch64 + cu130
manylinux wheels, and vllm 0.20.0 ships an aarch64 wheel whose Requires-
Dist locks torch==2.11.0 / torchvision==0.26.0 / torchaudio==2.11.0. That
makes uv's resolver produce an ABI-consistent set automatically, so the
mirror and the [tool.uv.sources] pinning are no longer needed.

flash-attn is dropped from the dep list: PyPI has no aarch64 wheel, but
vLLM 0.20+ already bundles its own vllm_flash_attn (fa2 + fa3) inside the
main wheel, so the Dao-AILab package isn't required at runtime.

Reference: https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] [WebFetch]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(vllm): retire l4t13 pyproject.toml in favor of requirements-*.txt

pyproject.toml only existed because uv pip install -r requirements.txt
doesn't honor [tool.uv.sources]. The previous commit dropped [tool.uv.
sources] (PyPI now serves the aarch64 + cu130 wheels directly), so the
file no longer carries any logic the requirements-*.txt path can't.

Replace with the same two-file pattern every other build profile uses:

  - requirements-l4t13.txt       (accelerate / torch / transformers /
                                  bitsandbytes - matches cublas13's split)
  - requirements-l4t13-after.txt (vllm; runs after the base resolve so
                                  the cu130 torch wheel lands first)

install.sh's whole l4t13 elif branch goes away; libbackend.sh's
installRequirements already handles the requirements-install.txt build-
deps pass, the C_INCLUDE_PATH export for PORTABLE_PYTHON, and the
runProtogen call, so falling through to the standard else: branch
produces identical install behavior with less surface area.

No functional change at install time - same wheels, same order.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(sglang,vllm-omni): switch L4T13 backends to PyPI aarch64+cu130 wheels

Same root cause and same fix as the vllm backend in the previous commits:
the L4T13 sglang and vllm-omni backends both pulled their accelerator
stack from pypi.jetson-ai-lab.io's sbsa/cu130 mirror with no version
pins, so they would silently land on the same torch 2.11 vs cu130-built
wheel ABI mismatch the moment the mirror published an out-of-sync pair.

sglang
------

- Drop pyproject.toml + [tool.uv.sources]. The historical comment said
  the [all] extra was unsafe on aarch64 because of decord, but sglang
  0.5.x now uses `decord2` on aarch64/arm/armv7l (which ships cp312
  aarch64 wheels), so we can match cublas13's sglang[all]>=0.5.11 pin
  and stop being capped at the 0.5.1.post2 the L4T mirror shipped.
  That unblocks Gemma 4 / MTP recipes on Jetson Thor.
- New requirements-l4t13.txt mirrors the cublas13 split (accelerate /
  torch / torchvision / torchaudio / transformers), requirements-l4t13-
  after.txt carries sglang[all]>=0.5.11.
- install.sh's l4t13 elif branch goes away; falls through to the
  standard installRequirements path.

vllm-omni
---------

- requirements-l4t13.txt drops --extra-index-url to jetson-ai-lab and
  drops flash-attn (PyPI has no aarch64 wheel, vLLM 0.20+ bundles its
  own vllm_flash_attn fa2 + fa3 internally).
- install.sh's l4t13 vllm-install branch collapses into the cublas13
  branch since both now just run `pip install vllm --torch-backend=auto`
  against PyPI.
- --index-strategy=unsafe-best-match is dropped from the top-level
  l4t13 guard; without the L4T mirror in the picture it had no purpose.

The from-source vllm-omni install on top still keeps its existing
`sed -i '/^fa3-fwd[[:space:]]*==/d' requirements/cuda.txt` workaround -
fa3-fwd has no aarch64 wheel and no sdist, unrelated to flash-attn.

Reference: https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] [WebFetch]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(sglang): drop [all] extra on l4t13 - xatlas has no aarch64 wheel

CI revealed that sglang[all]==0.5.12 transitively pulls xatlas via the
[diffusion] sub-extra, and xatlas ships no aarch64 wheel. Its sdist
depends on scikit_build_core without declaring it in build-system.
requires, so under --no-build-isolation uv can't build it from source:

    × Failed to build `xatlas==0.0.11`
    ├─▶ The build backend returned an error
    ╰─▶ Call to `scikit_build_core.build.build_wheel` failed (exit status: 1)
        ModuleNotFoundError: No module named 'scikit_build_core'
    help: `xatlas` (v0.0.11) was included because `sglang[all]` (v0.5.12)
          depends on `xatlas`

Upstream sglang explicitly gates st_attn and vsa on
`platform_machine != aarch64` inside the same [diffusion] extra but
forgot xatlas - same class of bug that bit the old decord pin.

Use plain `sglang>=0.5.11` on l4t13. backend.py imports only base
sglang.srt symbols (Engine, ServerArgs, FunctionCallParser,
ReasoningParser); the [all] extras are optional accelerators not
required at import time. cublas13 (x86_64) keeps [all] because xatlas
has x86_64 wheels there.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-22 23:01:22 +02:00
LocalAI [bot]
c500461c69 feat(config): default prompt_cache_all to true (#9951)
Upstream llama.cpp defaults `cache_prompt = true` (common/common.h),
but `parse_options` in the grpc-server backend unconditionally forwards
the proto `PromptCacheAll` field, so any model that didn't set
`prompt_cache_all: true` in its YAML was getting `cache_prompt=false` —
silently overriding llama.cpp's own default. With `kv_unified` and
`cache_idle_slots` already on by default, this was the last piece
preventing the per-request prompt cache from being usable out of the
box.

Make `PromptCacheAll` tristate (`*bool`), default it to `true` in
`SetDefaults`, and dereference at the proto boundary. Users can still
opt out with an explicit `prompt_cache_all: false`. Same pattern as
`MMap`, `MMlock`, `Reranking`, etc.

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 22:06:22 +02:00
LocalAI [bot]
834ecc36bf fix(react-ui): unify backend-logs entry point for distributed mode (#9949)
In distributed mode the local /api/backend-logs WebSocket has nothing
behind it (inference runs on workers), so the "View backend logs" link
in Traces (and the action in Manage when previously not hidden) dead-
ended on /app/backend-logs/<modelId>. Manage worked around it by
hiding the action; Traces still rendered the link.

Make /app/backend-logs/:modelId the single, mode-aware entry point.
A new BackendLogsRouter probes useDistributedMode and forks:

  - standalone: existing local WebSocket view (BackendLogsDetail).
  - distributed: DistributedBackendLogsResolver fans out to each node
    via nodesApi.getModels, filters by model_name, and routes:
      * 0 hits   -> empty state with a link to the Nodes page.
      * 1 hit    -> <Navigate replace> to
                    /app/node-backend-logs/<nodeId>/<modelId>,
                    preserving the ?from= deep-link timestamp.
      * N hits   -> picker listing each hosting worker (node id,
                    replica index, load state) so the operator can
                    choose which worker's logs to view.

Bare modelId in the redirect target intentionally aggregates that
node's replicas via the worker's BackendLogStore, matching the
existing per-node link pattern in Nodes.jsx.

Revert the per-caller distributed checks now that routing is
centralised: drop the hidden:distributedMode guard on Manage's
Backend logs action, and remove the prop threading in Traces so the
link is unconditional. Any future view that wants to link to backend
logs uses the same URL and gets correct behaviour in both modes.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-22 22:00:08 +02:00
LocalAI [bot]
61bf34ea2f fix(traces): cap captured body size to keep admin Traces UI responsive (#9946)
The trace middleware buffered the full request and response bodies for every
JSON exchange. With a chatty agent-pool RAG workload, /embeddings responses
(large vector arrays) accumulated to tens of MB in the in-memory buffer; the
admin Traces page would then download and parse 40+ MB on every load and on
every 5s auto-refresh, locking the UI in a loading state.

Add LOCALAI_TRACING_MAX_BODY_BYTES (default 64 KiB) that caps each captured
body. The full payload still flows through to the real client; only the
trace copy is bounded. Exchanges record body_truncated and original
body_bytes so the dashboard can show that truncation happened. The cap is
configurable via env, CLI, and runtime_settings.json.

Also unblock recovery: the Traces page now keeps the Clear button enabled
while loading, since "buffer too large to render" is exactly when the user
needs to clear it.


Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-22 15:29:24 +02:00
LocalAI [bot]
0b2ae3c6ca fix(openai): stream usage non-zero when tools are enabled (#9941)
* chore: ignore local .worktrees directory

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(openai): stream usage non-zero when tools are enabled

The streaming chat-completions worker for tool-bearing requests
(processTools in core/http/endpoints/openai/chat.go) never forwarded the
cumulative TokenUsage from ComputeChoices to the chunks it placed on the
responses channel. The outer streaming loop's running usage tracker
therefore stayed at the zero value, and the include_usage trailer
reported {prompt_tokens:0, completion_tokens:0, total_tokens:0} whenever
the request carried a `tools` array. Without tools, the alternative
`process` path stamps Usage on every chunk, so that path was unaffected.

Forward the final TokenUsage via a usage-only sentinel chunk (empty
Choices, populated Usage) emitted right before close(responses). The
outer loop's per-chunk Usage capture moves above the empty-Choices skip
so the sentinel updates the tracker without ever reaching the wire,
keeping the existing OpenAI spec contract (intermediate chunks carry no
`usage` field, and the deferred-final-chunk helpers remain Usage-free
per the regression test for issue #8546).

Adds streamUsageFromTokenUsage, usageSentinelChunk, and
applyChunkToUsage helpers with focused Ginkgo coverage plus a flow-level
test that mirrors the outer-loop sequence.

Fixes #9927

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Claude Code]

* refactor(openai): return final TokenUsage from stream workers

Replace the usage-only sentinel SSE chunk introduced in the previous
commit with a plain return value. The streaming workers process and
processTools (now extracted as package-level processStream and
processStreamWithTools) return (backend.TokenUsage, error); the outer
ChatEndpoint loop reads the cumulative counts off the existing `ended`
channel (now carrying streamWorkerResult{usage, err}) and builds the
include_usage trailer from a normal Go value after the LOOP exits.

This drops the empty-Choices "skip but capture Usage" rule from the
outer loop and removes the usageSentinelChunk / applyChunkToUsage
helpers entirely. The SSE responses channel is back to a single
purpose: wire chunks only.

processStream and processStreamWithTools move into chat_stream_workers.go
so they can be exercised directly from tests. The chat_stream_usage_test.go
suite now drives the workers with a mocked backend.ModelInferenceFunc
and asserts on the returned TokenUsage. The regression coverage for
issue #9927 is therefore behavioral: reverting the fix (discarding
ComputeChoices' usage return) makes the assertions fail with concrete
count mismatches.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-22 10:13:41 +02:00
LocalAI [bot]
4735345105 chore: ⬆️ Update ggml-org/llama.cpp to bb28c1fe246b72276ee1d00ce89306be7b865766 (#9934)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-22 09:49:33 +02:00
LocalAI [bot]
7384fd800b chore: ⬆️ Update antirez/ds4 to 8d576642c39b9a2d782a80159ba84ef5a81c0b81 (#9932)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-22 08:31:49 +02:00
LocalAI [bot]
6942713d85 chore: ⬆️ Update leejet/stable-diffusion.cpp to 3a8788cb7d74f185d6b18688e9563015524ecaf5 (#9933)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-22 00:31:19 +02:00
LocalAI [bot]
0cf52c44d4 chore: ⬆️ Update ggml-org/whisper.cpp to 8443cf05e3fa8ce1b32348e1bcbcf8fc31f7f3ae (#9929)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-21 23:24:01 +02:00
LocalAI [bot]
0d34cf7cbd chore: ⬆️ Update ikawrakow/ik_llama.cpp to 48a55f74e4c6e2aeda363dd386c1ac9170a0af71 (#9930)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-21 23:23:37 +02:00
LocalAI [bot]
f0cb02afb8 feat(usage): attribute Sources rows to user accounts in admin view (#9935)
The merged feature (#9920) let admins see per-API-key and per-source
totals but did not surface which user owned each key, and lumped
every user's Web UI traffic into a single global Web UI row. This
makes the admin Sources tab properly per-user attributable:

- KeyTotal gains UserID + UserName, populated from the snapshot the
  usage middleware already records. The by_key roll-up now groups by
  (api_key_id, api_key_name, user_id, user_name).
- New SourceTotals.ByUserSource roll-up groups (source, user_id,
  user_name) for sources without a key identity (web, legacy). Only
  populated on the admin path (includeLegacy=true); the non-admin
  endpoint stays unchanged for backwards compatibility.
- SourcesTable accepts showUserColumn={isAdmin}; admin view renders
  a User column, makes the search match user name/id, and expands
  Web UI / legacy pseudo-rows from the global aggregate to one row
  per user using by_user_source.

Refs: #9862

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 23:23:06 +02:00
LocalAI [bot]
a39e025d64 fix(nodes): make per-node backend install async via gallery job queue (#9928)
* feat(galleryop): add TargetNodeID to ManagementOp for single-node installs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(galleryop): add NodeScopedKey helpers for per-node opcache rows

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(galleryop): use strings.Cut for NodeScopedKey parsing, reject empty nodeID

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): scope DistributedBackendManager.InstallBackend to single node via TargetNodeID

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(http): make /api/nodes/:id/backends/install async via gallery service job queue

The handler previously called unloader.InstallBackend synchronously and
blocked the browser for up to 3 minutes waiting on the NATS reply. It now
enqueues a TargetNodeID-scoped ManagementOp on BackendGalleryChannel and
returns HTTP 202 + jobID immediately, matching /api/backends/install/:id.

The opcache key is built via NodeScopedKey(nodeID, backend) so concurrent
installs of the same backend across different nodes do not stomp each
other. galleryService/opcache/appConfig are threaded through
RegisterNodeAdminRoutes for this.

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(http): log malformed backend_galleries override and stop test drain goroutine

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(api): expose nodeID for node-scoped backend ops in /api/operations

Node-scoped backend installs land in opcache under "node:<nodeID>:<backend>"
keys. Without splitting that prefix back out, the operations panel renders
the full key as the display name and has no structured way to label which
worker an install is targeting. Detect the prefix, surface nodeID as its own
response field, and reduce the display name back to the bare backend slug.
Bare (non-scoped) ops are left untouched so legacy installs do not gain a
misleading empty nodeID.

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(react-ui): poll job status for node-targeted backend installs

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(react-ui): make NodeInstallPicker state updates pure and surface cancellations as errors

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(react-ui): clarify async semantics in handleInstallOnTarget

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(http): use statusUrl casing for node install response to match codebase precedent

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 22:25:53 +02:00
Ettore Di Giacinto
05e8e1e9f4 ci(images): publish chronologically-orderable master-<epoch>-<sha> tags
The existing master push pipeline produces `master` (rolling) and
`sha-<short>` tags. Neither is orderable by build time, so downstream
GitOps that want to auto-bump to the newest master build (e.g. Flux
ImagePolicy) can't pick the latest from the tag list — alphabetical
sort over hex shas is effectively random, and the rolling `master`
tag can't be referenced as an immutable bump target.

Add a third tag of the form `master-<epoch>-<sha>` (Unix epoch in
seconds + short sha), gated on default-branch pushes via metadata-
action's `is_default_branch` predicate. The sha is retained for
traceability; the epoch makes the tags numerically orderable, so a
Flux ImagePolicy like

  filterTags:
    pattern: '^master-(?P<ts>[0-9]+)-[a-f0-9]+$'
    extract: '$ts'
  policy:
    numerical:
      order: asc

will reliably bump to the newest master build.

Applied to both image_build.yml (OCI labels stay consistent) and
image_merge.yml (the actual tag publisher via buildx imagetools).
2026-05-21 17:18:30 +00:00
Rin
a7f6cc8956 [utils] Fail immediately on extraction errors (#9926)
utils: fail immediately on extraction errors

Setting ContinueOnError to false ensures that ExtractArchive does not
leave the model or backend directory in an inconsistent state if a
partial failure occurs. This improves robustness against malformed
archives or unexpected I/O issues during installation.

Signed-off-by: RinZ27 <222222878+RinZ27@users.noreply.github.com>
2026-05-21 19:00:33 +02:00
LocalAI [bot]
f15b9178ec feat(usage): track and visualise usage per API key (#9920)
* feat(usage): add Source, APIKeyID, APIKeyName columns to UsageRecord

Adds three additive columns plus UsageSource* constants. The columns
are auto-migrated by InitDB. APIKeyID is a nullable foreign reference
to UserAPIKey.ID; APIKeyName is snapshotted on each row so revoked
keys keep showing their name in history.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): backfill Source on pre-feature usage rows

InitDB now classifies any pre-existing usage_record with an empty
source: 'legacy-api-key' user -> legacy, everything else -> web.
The backfill is idempotent (only touches NULL/empty rows).

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): add GetUserUsageBySource aggregator

Groups by (bucket, source, api_key_id, api_key_name). Filters out
legacy by default. Returns both per-bucket detail and roll-ups
(by_source, by_key sorted desc and capped at 200, grand_total).

The MAX(created_at) projection is iterated via Rows().Scan into a
string column and parsed manually because the SQLite driver surfaces
the aggregated timestamp as a string, which database/sql refuses to
scan directly into time.Time. Postgres returns a real timestamp; the
same string path handles its RFC3339 form too.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(usage): log Rows() errors and assert LastUsed in tests

Adds rows.Err() and Rows() open-failure logging in
computeSourceTotals so silent data drops surface in logs. Logs on
parseLastUsedString format misses for the same reason. Strengthens
the snapshot-survival test to assert LastUsed is a recent timestamp,
locking the SQLite time-string parser behaviour.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): add admin GetAllUsageBySource with filters and truncation

Optional user_id and api_key_id filters (composed with AND). Legacy
bucket is included for admin callers. truncated=true when more than
200 distinct keys would be in the by_key roll-up.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(auth): plumb auth_source and auth_apikey through Echo context

tryAuthenticate now sets auth_source on every successful branch
(web for session/Bearer-session, apikey for Bearer-key/x-api-key/
token-cookie, legacy for legacy env key match). For named-key
branches it also stores the resolved *UserAPIKey under auth_apikey
so downstream middlewares can snapshot id+name without re-validating.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(auth): expand tryAuthenticate godoc and cover Bearer-session branch

Documents all three context-keys side effects (auth_source,
auth_apikey, _auth_session) plus the split of responsibilities with
the parent Middleware. Adds a test for the Bearer-as-session-token
classification so future regressions there fail loudly.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): UsageMiddleware records source + snapshots key name

Reads auth_source and auth_apikey from the Echo context (set by
auth.Middleware in the previous task). Snapshots UserAPIKey.ID and
Name onto each row so revoked keys remain readable in history.
Falls back to source=web when no auth_source is set (auth disabled
or unrecognised path).

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): add /api/auth/usage/sources and admin variant

Self endpoint filters legacy server-side; admin endpoint includes
legacy and accepts user_id + api_key_id filters. Response includes
buckets, totals.{by_source, by_key, grand_total}, and a truncated
flag set when the per-key roll-up was capped at 200.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(routes): mark test mirror handlers as keep-in-sync with production

The newTestAuthApp helper duplicates production route handlers
inline because it cannot use RegisterAuthRoutes (which requires a
*application.Application). Naming the source path on each mirror
makes the drift contract explicit for future maintainers.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add usageApi.getMySources/getAdminSources + i18n strings

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add Sources tab skeleton with data fetch

Adds Usage page tab that fetches /api/auth/usage/sources (or the
admin variant). Renders raw totals plus a placeholder key list;
real visualisations land in subsequent commits. Restructures the
existing tab button block so Models and Sources are visible to
non-admins (Users remains admin-only).

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): source mix ribbon + searchable/sortable sources table

Replaces the SourcesTab placeholder rendering with two reusable
components: SourceMixRibbon (one segmented bar per source class)
and SourcesTable (search + sort + revoked-key dim). Pulls the
current API key list to detect revoked keys.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ui): skip revoked-key detection until the key list is known

existingKeyIds defaulted to an empty Set, which made every live
api_key row render as (revoked) during the brief window before
apiKeysApi.list() resolved, and permanently after a fetch failure.
Use null as the unknown state and suppress the revoked badge until
the parent provides a real Set.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): top-N stacked time chart and drill-in chip for Sources tab

Top 7 sources by total tokens get distinct colours; the rest roll up
into 'Other'. Clicking a row in the SourcesTable dims everything
except that series in the chart; the chip is the canonical clear.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(usage): document per-API-key Sources tab and endpoints

Extends features/authentication.md Usage Tracking section with:
- A 'Sources' tab description and source-class taxonomy
- Endpoint documentation for /api/auth/usage/sources and the
  admin variant
- Response shape example with by_source / by_key / grand_total
- Migration note about pre-feature row backfill

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(usage): silence errcheck on deferred rows.Close

CI errcheck flagged the bare 'defer rows.Close()' in
computeSourceTotals. Wrap in a closure that discards the close
error explicitly; an error here is non-actionable since we have
already drained the rows and logged any iteration failure.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(usage): bound batcher intake and add Shutdown/FlushNow hooks

The pre-existing usage batcher had no cap on its add() path; the
usageMaxPending=5000 constant only guarded the re-queue path after
a failed write, leaving memory growth unbounded if the DB fell
behind. This commit:

- Adds the cap to add() so saturation drops new records (rate-limited
  warn at 1/1024) instead of growing unbounded.
- Raises usageMaxPending to 50000 to absorb realistic inference bursts.
- Replaces the package-level batcher global with a mutex-guarded pair
  plus a currentBatcher() accessor so Init / Shutdown cycles are
  race-free.
- Adds ShutdownUsageRecorder() for graceful drain on process exit
  (not yet wired into app shutdown, just published).
- Adds FlushNow() for deterministic tests; the middleware suite no
  longer needs 6s sleeps per spec and now runs in ~50ms instead of 18s.
- Re-queue on failed flush is now cap-aware: prepends as much of the
  failed batch as fits alongside concurrent arrivals, instead of
  dropping the whole batch when full.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): drain usage batcher on graceful shutdown

Registers ShutdownUsageRecorder with the existing
signals.RegisterGracefulTerminationHandler so SIGINT/SIGTERM
synchronously flushes any in-memory usage records before the
process exits. Without this, up to one flush interval (5s) of
recorded usage was lost when LocalAI restarted.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 16:34:02 +02:00
LocalAI [bot]
959de86761 feat(llama-cpp): make server-side prompt cache work by default (#9925)
Aligns LocalAI's llama-cpp gRPC backend with upstream's auto-on prompt
cache path so repeated system prompts (agents, OpenAI/Anthropic-compatible
CLIs, coding assistants) skip prefill on subsequent calls without any
YAML changes. Reported in #9921.

Upstream's server enables `kv_unified=true` (and bumps `n_parallel` to 4)
when slot count is auto, which unlocks `cache_idle_slots`. LocalAI
hardcodes `n_parallel=1` and so far also hardcoded `kv_unified=false`,
which silently force-disables idle-slot saving at server init. The host
prompt cache was allocated but never written across requests.

Changes in backend/cpp/llama-cpp/grpc-server.cpp:
- params.kv_unified: false -> true (single-slot path now benefits from
  the prompt cache; users can opt out with `kv_unified:false`)
- params.n_ctx_checkpoints: 8 -> 32 (match upstream default)
- params.cache_idle_slots = true initialized explicitly (upstream default)
- params.checkpoint_every_nt = 8192 initialized explicitly (upstream default)
- New option parsers: cache_idle_slots / idle_slots_cache,
  checkpoint_every_nt / checkpoint_every_n_tokens

Docs:
- features/text-generation.md: fix misleading `cache_ram` description
  (it's the host-side prompt cache, not the KV cache), document the
  kv_unified + cache_ram + cache_idle_slots interaction, add rows for
  the two newly-exposed options, and add a worked example for the
  agent/CLI workload from the issue.
- advanced/model-configuration.md: mark the legacy `prompt_cache_path`
  / `prompt_cache_all` / `prompt_cache_ro` YAML fields as unused by the
  llama-cpp gRPC backend (they target upstream's CLI completion tool
  and are not consumed by grpc-server.cpp) and point readers at the
  new prompt-cache explainer.

Closes #9921

Assisted-by: claude:opus-4.7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 16:31:48 +02:00
LocalAI [bot]
4c234abc2c refactor(agents): bump skillserver, drop redundant Name from list_skills output (#9916)
refactor(agents): bump skillserver, drop redundant Name from list_skills/search_skills

skillserver's list_skills MCP tool used to ship every entry with name=""
(field was commented out), while search_skills populated it - two tools
with inconsistent shape for the same data. skill.Name and skill.ID are
populated from the same source string anyway (the directory name), so
returning both was pure duplication.

Bumps github.com/mudler/skillserver to a7317cb, which drops the Name
field from both SkillInfo and SearchResult and leaves ID as the single
canonical identifier (already what read_skill consumes).

Adds core/services/skills/skills_mcp_test.go, a regression that drives
the LocalAI FilesystemManager through an in-process MCP session and
asserts a newly-created skill is visible by ID on the still-open session.

This is a cleanup, not the root cause of #9868 - the reporter likely
sees something deeper than a cosmetic JSON shape issue.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 14:45:53 +02:00
134 changed files with 7063 additions and 997 deletions

View File

@@ -16,7 +16,8 @@ side (`pkg/oci/cosignverify` plus the gallery YAML).
per-arch manifest before checking signatures.
- **Storage:** Signatures are written as OCI 1.1 referrers
(`--registry-referrers-mode=oci-1-1`) in the new Sigstore bundle format
(`--new-bundle-format`). No `:sha256-<hex>.sig` tag clutter.
(current cosign releases do this by default; no `--new-bundle-format`
flag). No `:sha256-<hex>.sig` tag clutter.
- **Consumer:** `pkg/oci/cosignverify` discovers the bundle via the
referrers API, hands it to `sigstore-go`, and verifies it against the
policy declared in the gallery YAML (`Gallery.Verification`).
@@ -33,15 +34,14 @@ to sign. The job needs:
- `permissions: { id-token: write, contents: read }` at the job level so
the runner can exchange its GitHub OIDC token for a Fulcio cert.
- `sigstore/cosign-installer@v3` step (cosign ≥ 2.2 for
`--new-bundle-format`).
- `sigstore/cosign-installer@v3` step (current cosign releases already
default to the new bundle format).
- After each `docker buildx imagetools create`, resolve the resulting
list digest with `docker buildx imagetools inspect <tag> --format
'{{.Manifest.Digest}}'` and sign:
```sh
cosign sign --yes --recursive \
--new-bundle-format \
--registry-referrers-mode=oci-1-1 \
"${REGISTRY_REPO}@${DIGEST}"
```
@@ -49,6 +49,12 @@ cosign sign --yes --recursive \
Sign by digest, never by tag — signing by tag binds the signature to
whatever the tag points at *now*, and a subsequent tag push orphans it.
`--registry-referrers-mode=oci-1-1` is still gated behind
`COSIGN_EXPERIMENTAL=1` in cosign v2.4.x (set at the job env level in
`backend_merge.yml`). Re-evaluate when bumping the pinned cosign release
— newer versions are expected to graduate this flag and the env var can
then be dropped.
`backend_build_darwin.yml` builds and pushes single-arch darwin images
that bypass the manifest-list merge. If/when those entries get a gallery
`verification:` policy, the equivalent cosign step has to land there

View File

@@ -40,6 +40,11 @@ jobs:
id-token: write
env:
quay_username: ${{ secrets.quayUsername }}
# cosign v2.4.x still gates --registry-referrers-mode=oci-1-1 behind
# this flag. Without it, signing fails with:
# invalid argument "oci-1-1" for "--registry-referrers-mode" flag:
# in order to use mode "oci-1-1", you must set COSIGN_EXPERIMENTAL=1
COSIGN_EXPERIMENTAL: '1'
steps:
# Sparse checkout: the merge job needs `.github/scripts/` (for the
# keepalive cleanup script) but none of the source tree.
@@ -66,7 +71,8 @@ jobs:
# cosign signs each pushed manifest list with --recursive so the
# index and every per-arch entry get an attached Sigstore bundle.
# 2.2+ is required for --new-bundle-format.
# Recent cosign releases always emit the new bundle format, so
# there's no extra CLI flag to opt into it.
- name: Install cosign
if: github.event_name != 'pull_request'
uses: sigstore/cosign-installer@v3
@@ -153,7 +159,6 @@ jobs:
# manifest before checking signatures need the per-arch
# signatures, not just the list-level one.
cosign sign --yes --recursive \
--new-bundle-format \
--registry-referrers-mode=oci-1-1 \
"quay.io/go-skynet/local-ai-backends@${digest}"
@@ -180,7 +185,6 @@ jobs:
' <<< "$DOCKER_METADATA_OUTPUT_JSON")
digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
cosign sign --yes --recursive \
--new-bundle-format \
--registry-referrers-mode=oci-1-1 \
"localai/localai-backends@${digest}"

View File

@@ -106,6 +106,7 @@ jobs:
type=ref,event=branch
type=semver,pattern={{raw}}
type=sha
type=raw,value={{branch}}-{{date 'X'}}-{{sha}},enable={{is_default_branch}}
flavor: |
latest=${{ inputs.tag-latest }}
suffix=${{ inputs.tag-suffix }},onlatest=true

View File

@@ -80,6 +80,7 @@ jobs:
type=ref,event=branch
type=semver,pattern={{raw}}
type=sha
type=raw,value={{branch}}-{{date 'X'}}-{{sha}},enable={{is_default_branch}}
flavor: |
latest=${{ inputs.tag-latest }}
suffix=${{ inputs.tag-suffix }},onlatest=true

3
.gitignore vendored
View File

@@ -77,3 +77,6 @@ local-backends/
tests/e2e-ui/ui-test-server
core/http/react-ui/playwright-report/
core/http/react-ui/test-results/
# Local worktrees
.worktrees/

View File

@@ -1,10 +1,10 @@
# ds4 backend Makefile.
#
# Upstream pin lives below as DS4_VERSION?=2606543be7a8c125a32cee37f5d1d85dc78f2fcf
# Upstream pin lives below as DS4_VERSION?=444afce822057d87f14c4dec307dce24fd49b3ee
# (.github/bump_deps.sh) can find and update it - matches the
# llama-cpp / ik-llama-cpp / turboquant convention.
DS4_VERSION?=2606543be7a8c125a32cee37f5d1d85dc78f2fcf
DS4_VERSION?=444afce822057d87f14c4dec307dce24fd49b3ee
DS4_REPO?=https://github.com/antirez/ds4
CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))

View File

@@ -1,5 +1,5 @@
IK_LLAMA_VERSION?=11a1fea9e291f12ce2c803a9d7812c30ca806bcf
IK_LLAMA_VERSION?=642c038ccdf3dd08e6d9ac6fdc3b1c311ebd8a02
LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
CMAKE_ARGS?=

View File

@@ -1,5 +1,5 @@
LLAMA_VERSION?=ad277572619fcfb6ddd38f4c6437283a4b2b8636
LLAMA_VERSION?=c0c7e147e7efa6c5858754b47259ba4880f8a906
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
CMAKE_ARGS?=

View File

@@ -517,10 +517,27 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
params.warmup = true;
// no_op_offload: disable host tensor op offload (default: false)
params.no_op_offload = false;
// kv_unified: enable unified KV cache (default: false)
params.kv_unified = false;
// n_ctx_checkpoints: max context checkpoints per slot (default: 8)
params.n_ctx_checkpoints = 8;
// kv_unified: enable unified KV cache. Upstream's server auto-enables this
// when the slot count is auto (-np <0), bumping n_parallel to 4 alongside.
// LocalAI keeps n_parallel=1 by default, which would skip that auto path
// and leave kv_unified=false. We flip the default to true here so the
// server-side prompt cache (cache_idle_slots) is actually usable on the
// single-slot path that LocalAI ships with: without it, idle slots are
// never persisted across requests and the prompt cache is dead weight.
// Users can opt out with `options: [ "kv_unified:false" ]`.
params.kv_unified = true;
// n_ctx_checkpoints: max context checkpoints per slot. Match upstream's
// default (32); the previous LocalAI-specific 8 was unnecessarily tight
// and limits partial-prefix recovery without a clear memory rationale.
params.n_ctx_checkpoints = 32;
// cache_idle_slots: save and clear idle slot KV to the prompt cache on
// task switch. Upstream default is true; the server auto-disables it if
// kv_unified=false or cache_ram_mib=0, so flipping kv_unified above is
// what actually unlocks it.
params.cache_idle_slots = true;
// checkpoint_every_nt: create a context checkpoint every N tokens during
// prefill (-1 disables). Match upstream's default (8192).
params.checkpoint_every_nt = 8192;
// decode options. Options are in form optname:optvale, or if booleans only optname.
for (int i = 0; i < request->options_size(); i++) {
@@ -679,7 +696,29 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
try {
params.n_ctx_checkpoints = std::stoi(optval_str);
} catch (const std::exception& e) {
// If conversion fails, keep default value (8)
// If conversion fails, keep default value (32)
}
}
// --- server-side idle-slot prompt cache toggle (upstream --cache-idle-slots) ---
// Saves the slot's KV state into the host-side prompt cache on task
// switch so a later request with the same prefix can warm-load it.
// Auto-disabled by the server if kv_unified=false or cache_ram=0.
} else if (!strcmp(optname, "cache_idle_slots") || !strcmp(optname, "idle_slots_cache")) {
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
params.cache_idle_slots = true;
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
params.cache_idle_slots = false;
}
// --- prefill checkpoint cadence (upstream -cpent / --checkpoint-every-n-tokens) ---
// -1 disables checkpointing during prefill.
} else if (!strcmp(optname, "checkpoint_every_nt") || !strcmp(optname, "checkpoint_every_n_tokens")) {
if (optval != NULL) {
try {
params.checkpoint_every_nt = std::stoi(optval_str);
} catch (const std::exception& e) {
// If conversion fails, keep default value (8192)
}
}

View File

@@ -1,7 +1,7 @@
# Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
# Auto-bumped nightly by .github/workflows/bump_deps.yaml.
TURBOQUANT_VERSION?=4c1c3ac09d2dba0aa9a55b94f6c50c41a92f9c8c
TURBOQUANT_VERSION?=5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403
LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant
CMAKE_ARGS?=

View File

@@ -1,23 +1,30 @@
#!/bin/bash
# Patch the shared backend/cpp/llama-cpp/grpc-server.cpp *copy* used by the
# turboquant build:
# turboquant build to account for the gaps between upstream and the fork:
#
# 1. Augment the kv_cache_types[] allow-list so `LoadModel` accepts the
# fork-specific `turbo2` / `turbo3` / `turbo4` cache types.
#
# Historical context: this script used to also paper over API gaps between the
# fork and upstream (flat vs nested `common_params_speculative`, missing
# `get_media_marker()`, `ctx_server.impl->model` vs `model_tgt`, and a
# LOCALAI_LEGACY_LLAMA_CPP_SPEC compile gate). As of TURBOQUANT_VERSION
# 4c1c3ac0 the fork has rebased past ggml-org/llama.cpp#21962, #22397 and
# #22838, so the shared grpc-server.cpp compiles unmodified against the fork.
# Only the fork-specific KV-cache enum entries remain.
# 2. Replace `get_media_marker()` (added upstream in ggml-org/llama.cpp#21962,
# server-side random per-instance marker) with the legacy "<__media__>"
# literal. The fork branched before that PR, so server-common.cpp has no
# get_media_marker symbol. The fork's mtmd_default_marker() still returns
# "<__media__>", and Go-side tooling falls back to that sentinel when the
# backend does not expose media_marker, so substituting the literal keeps
# behavior identical on the turboquant path.
# 3. Revert the `common_params_speculative` field references to the
# pre-refactor flat layout. Upstream ggml-org/llama.cpp#22397 split the
# struct into nested `draft` / `ngram_simple` / `ngram_mod` / etc. members;
# the turboquant fork branched before that PR and still exposes the flat
# `n_max`, `mparams_dft`, `ngram_size_n`, ... fields. The substitutions
# below map the new nested paths back to the legacy flat names so the
# shared grpc-server.cpp keeps compiling against the fork's common.h.
# Drop this block once the fork rebases past #22397.
#
# We patch the *copy* sitting in turboquant-<flavor>-build/, never the original
# under backend/cpp/llama-cpp/, so the stock llama-cpp build stays compiling
# under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps compiling
# against vanilla upstream.
#
# Idempotent: skips the insertion if its marker is already present (so re-runs
# Idempotent: skips each insertion if its marker is already present (so re-runs
# of the same build dir don't double-insert).
set -euo pipefail
@@ -45,7 +52,7 @@ else
awk '
/^ GGML_TYPE_Q5_1,$/ && !done {
print
print " // turboquant fork extras - added by patch-grpc-server.sh"
print " // turboquant fork extras added by patch-grpc-server.sh"
print " GGML_TYPE_TURBO2_0,"
print " GGML_TYPE_TURBO3_0,"
print " GGML_TYPE_TURBO4_0,"
@@ -65,4 +72,83 @@ else
echo "==> KV allow-list patch OK"
fi
if grep -q 'get_media_marker()' "$SRC"; then
echo "==> patching $SRC to replace get_media_marker() with legacy \"<__media__>\" literal"
# Only one call site today (ModelMetadata), but replace all occurrences to
# stay robust if upstream adds more. Use a temp file to avoid relying on
# sed -i portability (the builder image uses GNU sed, but keeping this
# consistent with the awk block above).
sed 's/get_media_marker()/"<__media__>"/g' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> get_media_marker() substitution OK"
else
echo "==> $SRC has no get_media_marker() call, skipping media-marker patch"
fi
if grep -q 'params\.speculative\.draft\.\|params\.speculative\.ngram_simple\.' "$SRC"; then
echo "==> patching $SRC to revert common_params_speculative refs to pre-#22397 flat layout"
# Each substitution is the exact post-refactor path → legacy flat field.
# Order doesn't matter because the source paths are disjoint, but we keep
# the most-specific (mparams.path) first for readability.
sed -E \
-e 's/params\.speculative\.draft\.mparams\.path/params.speculative.mparams_dft.path/g' \
-e 's/params\.speculative\.draft\.n_max/params.speculative.n_max/g' \
-e 's/params\.speculative\.draft\.n_min/params.speculative.n_min/g' \
-e 's/params\.speculative\.draft\.p_min/params.speculative.p_min/g' \
-e 's/params\.speculative\.draft\.p_split/params.speculative.p_split/g' \
-e 's/params\.speculative\.draft\.n_gpu_layers/params.speculative.n_gpu_layers/g' \
-e 's/params\.speculative\.draft\.n_ctx/params.speculative.n_ctx/g' \
-e 's/params\.speculative\.ngram_simple\.size_n/params.speculative.ngram_size_n/g' \
-e 's/params\.speculative\.ngram_simple\.size_m/params.speculative.ngram_size_m/g' \
-e 's/params\.speculative\.ngram_simple\.min_hits/params.speculative.ngram_min_hits/g' \
"$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> speculative field rename OK"
else
echo "==> $SRC has no post-#22397 speculative field refs, skipping spec rename patch"
fi
# 4. Revert the `ctx_server.impl->model_tgt` rename introduced by upstream
# ggml-org/llama.cpp#22838 (parallel drafting). The turboquant fork still
# exposes the field as `model` on `server_context_impl`. The two call sites
# are in the Rerank and ModelMetadata RPC handlers.
if grep -q 'ctx_server\.impl->model_tgt' "$SRC"; then
echo "==> patching $SRC to revert ctx_server.impl->model_tgt -> ctx_server.impl->model"
sed -E 's/ctx_server\.impl->model_tgt/ctx_server.impl->model/g' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> model_tgt rename OK"
else
echo "==> $SRC has no ctx_server.impl->model_tgt refs, skipping model_tgt rename patch"
fi
# 5. Define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top of the file so the
# grpc-server option parser skips the new option-handler blocks (ngram_mod,
# ngram_map_k, ngram_map_k4v, ngram_cache, draft.cache_type_*, draft.cpuparams*,
# draft.tensor_buft_overrides) introduced for the post-#22838 layout. Those
# blocks reference struct fields that simply do not exist in the fork.
if grep -q '^#define LOCALAI_LEGACY_LLAMA_CPP_SPEC' "$SRC"; then
echo "==> $SRC already defines LOCALAI_LEGACY_LLAMA_CPP_SPEC, skipping"
else
echo "==> patching $SRC to define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top"
# Insert the define before the very first `#include` so it precedes all the
# speculative-decoding code paths.
awk '
!done && /^#include/ {
print "#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1"
print "// ^ injected by backend/cpp/turboquant/patch-grpc-server.sh"
print ""
done = 1
}
{ print }
END {
if (!done) {
print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_LEGACY_LLAMA_CPP_SPEC" > "/dev/stderr"
exit 1
}
}
' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> LOCALAI_LEGACY_LLAMA_CPP_SPEC define OK"
fi
echo "==> all patches applied"

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# stablediffusion.cpp (ggml)
STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
STABLEDIFFUSION_GGML_VERSION?=5b0267e941cade15bd80089d89838795d9f4baa6
STABLEDIFFUSION_GGML_VERSION?=a397e03488cc27e1a42da646b82dfce9f50741c0
CMAKE_ARGS+=-DGGML_MAX_NAME=128

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# whisper.cpp version
WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
WHISPER_CPP_VERSION?=afa2ea544fb4b0448916b4a31ecd33c8685bd482
WHISPER_CPP_VERSION?=0ccd896f5b882628e1c077f9769735ef4ce52860
SO_TARGET?=libgowhisper.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -36,15 +36,11 @@ fi
# flash-attn-4 4.0 stable lands.
EXTRA_PIP_INSTALL_FLAGS+=" --prerelease=allow"
# JetPack 7 / L4T arm64 wheels are built for cp312 and shipped via
# pypi.jetson-ai-lab.io. Bump the venv Python so the prebuilt sglang
# wheel resolves cleanly. The actual install on l4t13 goes through
# pyproject.toml (see the elif branch below) so [tool.uv.sources] can
# pin only torch/torchvision/torchaudio/sglang to the jetson-ai-lab
# index — leaving PyPI as the path for transitive deps like
# markdown-it-py / anthropic / propcache that the L4T mirror's proxy
# 503s on. No --index-strategy flag here: the explicit index keeps the
# scoping clean.
# JetPack 7 / L4T arm64 sglang + torch wheels come straight from PyPI now
# (torch 2.11+ ships aarch64 + cu130 manylinux wheels and sglang 0.5.11+
# ships a cp312 aarch64 wheel pinned to that torch). They're cp312-only,
# so bump the venv Python accordingly.
# https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
PYTHON_VERSION="3.12"
PYTHON_PATCH="12"
@@ -110,27 +106,6 @@ if [ "x${BUILD_TYPE}" == "x" ] || [ "x${FROM_SOURCE:-}" == "xtrue" ]; then
fi
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} .
popd
# L4T arm64 (JetPack 7): drive the install through pyproject.toml so that
# [tool.uv.sources] can pin torch/torchvision/torchaudio/sglang to the
# jetson-ai-lab index, while everything else (transitive deps and
# PyPI-resolvable packages like transformers / accelerate) comes from
# PyPI. Bypasses installRequirements because uv pip install -r
# requirements.txt does not honor sources — see
# backend/python/sglang/pyproject.toml for the rationale. Mirrors the
# equivalent path in backend/python/vllm/install.sh.
elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
ensureVenv
if [ "x${PORTABLE_PYTHON}" == "xtrue" ]; then
export C_INCLUDE_PATH="${C_INCLUDE_PATH:-}:$(_portable_dir)/include/python${PYTHON_VERSION}"
fi
pushd "${backend_dir}"
# Build deps first (matches installRequirements' requirements-install.txt
# pass — sglang/sgl-kernel sdists need packaging/setuptools-scm in the
# venv before they can build under --no-build-isolation).
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} -r requirements-install.txt
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --requirement pyproject.toml
popd
runProtogen
else
installRequirements
fi

View File

@@ -1,68 +0,0 @@
# L4T arm64 (JetPack 7 / sbsa cu130) install spec for the sglang backend.
#
# Why this file exists, and why only the l4t13 BUILD_PROFILE consumes it:
#
# pypi.jetson-ai-lab.io hosts the L4T-specific torch / sglang / sgl-kernel
# wheels we need on aarch64 + cuda13, but it ALSO transparently proxies the
# rest of PyPI through `/+f/<sha>/<filename>` URLs that 503 frequently.
# With `--extra-index-url` + `--index-strategy=unsafe-best-match` (the
# historical fix in install.sh) uv would pick those proxy URLs for ordinary
# PyPI packages — markdown-it-py, anthropic, propcache, etc. — and trip on
# the 503s. See e.g. CI run 25439791228 (markdown-it-py-4.0.0).
#
# `explicit = true` on the index makes uv consult the L4T mirror ONLY for
# packages mapped under [tool.uv.sources]. Everything else goes to PyPI.
# This breaks the historical 503 path without losing access to the L4T
# wheels we actually need from there. Mirrors the equivalent fix already
# in backend/python/vllm/pyproject.toml.
#
# `uv pip install -r requirements.txt` does NOT honor [tool.uv.sources]
# (sources are project-mode only, not pip-compat mode), so install.sh's
# l4t13 branch invokes `uv pip install --requirement pyproject.toml`
# directly. Other BUILD_PROFILEs continue to use the requirements-*.txt
# pipeline through libbackend.sh's installRequirements and never read
# this file.
[project]
name = "localai-sglang-l4t13"
version = "0.0.0"
requires-python = ">=3.12,<3.13"
dependencies = [
# Mirror of requirements.txt — kept in sync manually for now since the
# l4t13 path bypasses installRequirements (see install.sh).
"grpcio==1.80.0",
"protobuf",
"certifi",
"setuptools",
"pillow",
# L4T-specific accelerator stack (sourced from jetson-ai-lab below).
"torch",
"torchvision",
"torchaudio",
# sglang on jetson — the [all] extra is deliberately omitted because it
# pulls outlines/decord, and decord has no aarch64 cp312 wheel anywhere
# (PyPI nor the jetson-ai-lab index ships only legacy cp35-cp37). With
# [all] uv backtracks through versions trying to satisfy decord and
# lands on sglang==0.1.16. The 0.5.0 floor matches the only major
# series the jetson-ai-lab sbsa/cu130 mirror currently publishes
# (sglang==0.5.1.post2 as of 2026-05-06). Bumping to >=0.5.11 here
# would make the build unsatisfiable until the mirror catches up.
# Gemma 4 / MTP recipes are therefore not supported on l4t13 — those
# features land on cublas12/cublas13 hosts that pull the newer wheel
# from PyPI. backend.py keeps backward compat with the 0.5.x SamplingParams
# field rename via runtime detection.
"sglang>=0.5.0",
# PyPI-resolvable packages that complete the runtime.
"accelerate",
"transformers",
]
[[tool.uv.index]]
name = "jetson-ai-lab"
url = "https://pypi.jetson-ai-lab.io/sbsa/cu130"
explicit = true
[tool.uv.sources]
torch = { index = "jetson-ai-lab" }
torchvision = { index = "jetson-ai-lab" }
torchaudio = { index = "jetson-ai-lab" }
sglang = { index = "jetson-ai-lab" }

View File

@@ -0,0 +1,15 @@
# sglang 0.5.11+ ships an aarch64 manylinux wheel on PyPI whose Requires-Dist
# pins torch==2.11.0 / torchaudio==2.11.0, locking an ABI-consistent set with
# the cu130 torch wheel installed above. 0.5.11 is the floor for Gemma 4
# support (sgl-project/sglang#21952).
#
# The [all] extra is deliberately NOT used on aarch64: it pulls the
# [diffusion] sub-extra which requires `xatlas`, and xatlas ships no
# aarch64 wheel and its sdist depends on scikit_build_core without
# declaring it in build-system.requires — so under --no-build-isolation
# uv can't build it. Upstream sglang gates st_attn and vsa on
# platform_machine != aarch64 in the diffusion extra but forgot xatlas.
# Plain `sglang` carries everything backend.py uses (Engine, ServerArgs,
# FunctionCallParser, ReasoningParser); the [all] extras are optional
# accelerators not required at import time.
sglang>=0.5.11

View File

@@ -0,0 +1,9 @@
# JetPack 7 / L4T arm64 + CUDA 13. Since PyTorch 2.11 (April 2026), PyPI ships
# aarch64 + cu130 manylinux wheels for torch/torchvision/torchaudio directly,
# so we no longer need a custom --extra-index-url for the L4T mirror.
# https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
accelerate
torch
torchvision
torchaudio
transformers

View File

@@ -13,14 +13,14 @@ else
fi
# Handle l4t build profiles (Python 3.12, pip fallback) if needed.
# unsafe-best-match is required on l4t13 because the jetson-ai-lab index
# lists transitive deps at limited versions — without it uv pins to the
# first matching index and fails to resolve a compatible wheel from PyPI.
# Since PyTorch 2.11 (April 2026) PyPI ships aarch64 + cu130 manylinux wheels
# directly for torch/torchvision/torchaudio and an aarch64 vllm wheel pinned
# to that torch, so the jetson-ai-lab mirror is no longer needed.
# https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
PYTHON_VERSION="3.12"
PYTHON_PATCH="12"
PY_STANDALONE_TAG="20251120"
EXTRA_PIP_INSTALL_FLAGS="${EXTRA_PIP_INSTALL_FLAGS:-} --index-strategy=unsafe-best-match"
fi
if [ "x${BUILD_PROFILE}" == "xl4t12" ]; then
@@ -42,18 +42,11 @@ if [ "x${BUILD_TYPE}" == "xhipblas" ]; then
else
uv pip install vllm==0.14.0 --extra-index-url https://wheels.vllm.ai/rocm/0.14.0/rocm700
fi
elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
# JetPack 7 / L4T arm64 cu130 — vllm comes from the prebuilt SBSA wheel
# at jetson-ai-lab. Version is unpinned: the index ships whatever build
# matches the cu130/cp312 ABI. unsafe-best-match lets uv fall through
# to PyPI for transitive deps not present on the jetson-ai-lab index.
if [ "x${USE_PIP}" == "xtrue" ]; then
pip install vllm --extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
else
uv pip install --index-strategy=unsafe-best-match vllm --extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
fi
elif [ "x${BUILD_PROFILE}" == "xcublas13" ]; then
# vllm 0.19+ defaults to cu130 wheels on PyPI, no extra index needed.
elif [ "x${BUILD_PROFILE}" == "xcublas13" ] || [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
# cublas13 (x86_64) and l4t13 (aarch64) both pull vllm from PyPI now:
# vllm 0.19+ defaults to cu130 wheels on x86_64 and vllm 0.20+ ships an
# aarch64 manylinux wheel pinned to torch==2.11.0. No extra index needed
# in either case.
if [ "x${USE_PIP}" == "xtrue" ]; then
pip install vllm --torch-backend=auto
else

View File

@@ -1,11 +1,15 @@
--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
# JetPack 7 / L4T arm64 + CUDA 13. PyPI ships aarch64 + cu130 manylinux wheels
# for torch/torchvision/torchaudio directly since PyTorch 2.11 (April 2026),
# so no custom index is needed. flash-attn is dropped here: PyPI has no
# aarch64 wheel for it, but vLLM 0.20+ bundles its own vllm_flash_attn
# (fa2 + fa3) inside the main wheel, so it is not required at runtime.
# https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
accelerate
torch
torchvision
torchaudio
transformers
bitsandbytes
flash-attn
diffusers
librosa
soundfile

View File

@@ -43,14 +43,11 @@ if [ "x${BUILD_PROFILE}" == "xcublas13" ]; then
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
fi
# JetPack 7 / L4T arm64 wheels (torch, vllm, flash-attn) live on
# pypi.jetson-ai-lab.io and are built for cp312, so bump the venv Python
# accordingly. JetPack 6 keeps cp310 + USE_PIP=true.
#
# l4t13 uses pyproject.toml (see the elif branch below) to pin only the
# L4T-specific wheels to the jetson-ai-lab index via [tool.uv.sources].
# That keeps PyPI as the resolution path for transitive deps like
# anthropic/openai/propcache, which the L4T mirror's proxy 503s on.
# JetPack 7 / L4T arm64 vllm + torch wheels come straight from PyPI now
# (torch 2.11+ ships aarch64 + cu130 manylinux wheels and vllm 0.20+ ships
# an aarch64 wheel pinned to that torch). They're cp312-only, so bump the
# venv Python accordingly. JetPack 6 keeps cp310 + USE_PIP=true.
# https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
if [ "x${BUILD_PROFILE}" == "xl4t12" ]; then
USE_PIP=true
fi
@@ -103,25 +100,6 @@ if [ "x${BUILD_TYPE}" == "xintel" ]; then
export CMAKE_PREFIX_PATH="$(python -c 'import site; print(site.getsitepackages()[0])'):${CMAKE_PREFIX_PATH:-}"
VLLM_TARGET_DEVICE=xpu uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --no-deps .
popd
# L4T arm64 (JetPack 7): drive the install through pyproject.toml so that
# [tool.uv.sources] can pin torch/vllm/flash-attn/torchvision/torchaudio
# to the jetson-ai-lab index, while everything else (transitive deps and
# PyPI-resolvable packages like transformers) comes from PyPI. Bypasses
# installRequirements because uv pip install -r requirements.txt does not
# honor sources — see backend/python/vllm/pyproject.toml for the rationale.
elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
ensureVenv
if [ "x${PORTABLE_PYTHON}" == "xtrue" ]; then
export C_INCLUDE_PATH="${C_INCLUDE_PATH:-}:$(_portable_dir)/include/python${PYTHON_VERSION}"
fi
pushd "${backend_dir}"
# Build deps first (matches installRequirements' requirements-install.txt
# pass — fastsafetensors and friends need pybind11 in the venv before
# their sdists can build under --no-build-isolation).
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} -r requirements-install.txt
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --requirement pyproject.toml
popd
runProtogen
# FROM_SOURCE=true on a CPU build skips the prebuilt vllm wheel in
# requirements-cpu-after.txt and compiles vllm locally against the host's
# actual CPU. Not used by default because it takes ~30-40 minutes, but

View File

@@ -1,61 +0,0 @@
# L4T arm64 (JetPack 7 / sbsa cu130) install spec for the vllm backend.
#
# Why this file exists, and why only the l4t13 BUILD_PROFILE consumes it:
#
# pypi.jetson-ai-lab.io hosts the L4T-specific torch / vllm / flash-attn
# wheels we need on aarch64 + cuda13, but it ALSO transparently proxies the
# rest of PyPI through `/+f/<sha>/<filename>` URLs that 503 frequently. With
# `--extra-index-url` + `--index-strategy=unsafe-best-match` (the historical
# fix in install.sh) uv would pick those proxy URLs for ordinary PyPI
# packages — `anthropic`, `openai`, `propcache`, `annotated-types` — and
# trip on the 503s. See e.g. CI run 25212201349 (anthropic-0.97.0).
#
# `explicit = true` on the index makes uv consult the L4T mirror ONLY for
# packages mapped under [tool.uv.sources]. Everything else goes to PyPI.
# This breaks the historical 503 path without losing access to the L4T
# wheels we actually need from there.
#
# `uv pip install -r requirements.txt` does NOT honor [tool.uv.sources]
# (sources are project-mode only, not pip-compat mode), so install.sh's
# l4t13 branch invokes `uv pip install --requirement pyproject.toml`
# directly. Other BUILD_PROFILEs continue to use the requirements-*.txt
# pipeline through libbackend.sh's installRequirements and never read
# this file.
[project]
name = "localai-vllm-l4t13"
version = "0.0.0"
requires-python = ">=3.12,<3.13"
dependencies = [
# Mirror of requirements.txt — kept in sync manually for now since the
# l4t13 path bypasses installRequirements (see install.sh).
"grpcio==1.80.0",
"protobuf",
"certifi",
"setuptools",
"pillow",
"charset-normalizer>=3.4.7",
"chardet",
# L4T-specific accelerator stack (sourced from jetson-ai-lab below).
"torch",
"torchvision",
"torchaudio",
"flash-attn",
"vllm",
# PyPI-resolvable packages that complete the runtime — accelerate,
# transformers, bitsandbytes carry their own wheels for aarch64.
"accelerate",
"transformers",
"bitsandbytes",
]
[[tool.uv.index]]
name = "jetson-ai-lab"
url = "https://pypi.jetson-ai-lab.io/sbsa/cu130"
explicit = true
[tool.uv.sources]
torch = { index = "jetson-ai-lab" }
torchvision = { index = "jetson-ai-lab" }
torchaudio = { index = "jetson-ai-lab" }
flash-attn = { index = "jetson-ai-lab" }
vllm = { index = "jetson-ai-lab" }

View File

@@ -0,0 +1,4 @@
# vLLM 0.20+ ships an aarch64 manylinux wheel on PyPI whose Requires-Dist pins
# torch==2.11.0 / torchvision==0.26.0 / torchaudio==2.11.0, locking an ABI-
# consistent set with the cu130 torch wheel installed above.
vllm

View File

@@ -0,0 +1,8 @@
# JetPack 7 / L4T arm64 + CUDA 13. Since PyTorch 2.11 (April 2026), PyPI ships
# aarch64 + cu130 manylinux wheels for torch/torchvision/torchaudio directly,
# so we no longer need a custom --extra-index-url for the L4T mirror.
# https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
accelerate
torch
transformers
bitsandbytes

View File

@@ -233,7 +233,12 @@ func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB, configLoade
xlog.Info("File stager initialized (HTTP direct transfer)")
}
// Create RemoteUnloaderAdapter — needed by SmartRouter and startup.go
remoteUnloader := nodes.NewRemoteUnloaderAdapter(registry, natsClient)
remoteUnloader := nodes.NewRemoteUnloaderAdapter(
registry,
natsClient,
cfg.Distributed.BackendInstallTimeoutOrDefault(),
cfg.Distributed.BackendUpgradeTimeoutOrDefault(),
)
// All dependencies ready — build SmartRouter with all options at once
var conflictResolver nodes.ConcurrencyConflictResolver

View File

@@ -17,9 +17,9 @@ import (
"github.com/mudler/LocalAI/core/services/jobs"
"github.com/mudler/LocalAI/core/services/nodes"
"github.com/mudler/LocalAI/core/services/storage"
"github.com/mudler/LocalAI/pkg/vram"
coreStartup "github.com/mudler/LocalAI/core/startup"
"github.com/mudler/LocalAI/internal"
"github.com/mudler/LocalAI/pkg/vram"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/LocalAI/pkg/sanitize"
@@ -200,7 +200,7 @@ func New(opts ...config.AppOption) (*Application, error) {
nodes.NewDistributedModelManager(options, application.modelLoader, distSvc.Unloader),
)
application.galleryService.SetBackendManager(
nodes.NewDistributedBackendManager(options, application.modelLoader, distSvc.Unloader, distSvc.Registry),
nodes.NewDistributedBackendManager(options, application.modelLoader, distSvc.Unloader, distSvc.Registry, application.galleryService),
)
}
}
@@ -552,6 +552,13 @@ func loadRuntimeSettingsFromFile(options *config.ApplicationConfig) {
options.TracingMaxItems = *settings.TracingMaxItems
}
}
if settings.TracingMaxBodyBytes != nil {
// Allow the on-disk setting to override the CLI/env default. The
// startup default is non-zero (see NewApplicationConfig), so a plain
// `== 0` guard like the others would never trigger; we instead respect
// any value the file specifies. 0 in the file means "uncapped".
options.TracingMaxBodyBytes = *settings.TracingMaxBodyBytes
}
// Branding / whitelabeling. There are no env vars for these — the file is
// the only source — so apply unconditionally. Without this block a server

View File

@@ -78,7 +78,7 @@ func ModelAudioTransform(
var startTime time.Time
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
startTime = time.Now()
}
@@ -104,7 +104,7 @@ func ModelAudioTransform(
data["sample_rate"] = res.SampleRate
data["samples"] = res.Samples
data["reference_provided"] = res.ReferenceProvided
if snippet := trace.AudioSnippet(dst); snippet != nil {
if snippet := trace.AudioSnippet(dst, appConfig.TracingMaxBodyBytes); snippet != nil {
maps.Copy(data, snippet)
}
}

View File

@@ -35,7 +35,7 @@ func Detection(
var startTime time.Time
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
startTime = time.Now()
}

View File

@@ -67,7 +67,7 @@ func ModelEmbedding(s string, tokens []int, loader *model.ModelLoader, modelConf
}
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
traceData := map[string]any{
"input_text": trace.TruncateString(s, 1000),

View File

@@ -32,7 +32,7 @@ func FaceAnalyze(
var startTime time.Time
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
startTime = time.Now()
}

View File

@@ -32,7 +32,7 @@ func FaceVerify(
var startTime time.Time
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
startTime = time.Now()
}

View File

@@ -41,7 +41,7 @@ func ImageGeneration(height, width, step, seed int, positive_prompt, negative_pr
}
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
traceData := map[string]any{
"positive_prompt": positive_prompt,

View File

@@ -305,7 +305,7 @@ func ModelInference(ctx context.Context, s string, messages schema.Messages, ima
}
if o.EnableTracing {
trace.InitBackendTracingIfEnabled(o.TracingMaxItems)
trace.InitBackendTracingIfEnabled(o.TracingMaxItems, o.TracingMaxBodyBytes)
traceData := map[string]any{
"chat_template": c.TemplateConfig.Chat,
@@ -316,9 +316,13 @@ func ModelInference(ctx context.Context, s string, messages schema.Messages, ima
"audios_count": len(audios),
}
// Cap the captured fields up front: agent-pool LLM calls embed the
// full augmented chat history in messages and the full reply in
// response, so without a per-field cap a single trace can dwarf the
// rest of the buffer. The cap matches the API-trace body cap.
if len(messages) > 0 {
if msgJSON, err := json.Marshal(messages); err == nil {
traceData["messages"] = string(msgJSON)
traceData["messages"] = trace.TruncateToBytes(string(msgJSON), o.TracingMaxBodyBytes)
}
}
if reasoningJSON, err := json.Marshal(c.ReasoningConfig); err == nil {
@@ -337,7 +341,7 @@ func ModelInference(ctx context.Context, s string, messages schema.Messages, ima
resp, err := originalFn()
duration := time.Since(startTime)
traceData["response"] = resp.Response
traceData["response"] = trace.TruncateToBytes(resp.Response, o.TracingMaxBodyBytes)
traceData["token_usage"] = map[string]any{
"prompt": resp.Usage.Prompt,
"completion": resp.Usage.Completion,
@@ -359,10 +363,10 @@ func ModelInference(ctx context.Context, s string, messages schema.Messages, ima
toolCallCount += len(d.ToolCalls)
}
if len(contentParts) > 0 {
chatDeltasInfo["content"] = strings.Join(contentParts, "")
chatDeltasInfo["content"] = trace.TruncateToBytes(strings.Join(contentParts, ""), o.TracingMaxBodyBytes)
}
if len(reasoningParts) > 0 {
chatDeltasInfo["reasoning_content"] = strings.Join(reasoningParts, "")
chatDeltasInfo["reasoning_content"] = trace.TruncateToBytes(strings.Join(reasoningParts, ""), o.TracingMaxBodyBytes)
}
if toolCallCount > 0 {
chatDeltasInfo["tool_call_count"] = toolCallCount

View File

@@ -21,7 +21,7 @@ func recordModelLoadFailure(appConfig *config.ApplicationConfig, modelName, back
if !appConfig.EnableTracing {
return
}
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
trace.RecordBackendTrace(trace.BackendTrace{
Timestamp: time.Now(),
Type: trace.BackendTraceModelLoad,
@@ -277,7 +277,7 @@ func gRPCPredictOpts(c config.ModelConfig, modelPath string) *pb.PredictOptions
MinP: float32(*c.MinP),
Tokens: int32(*c.Maxtokens),
Threads: int32(*c.Threads),
PromptCacheAll: c.PromptCacheAll,
PromptCacheAll: *c.PromptCacheAll,
PromptCacheRO: c.PromptCacheRO,
PromptCachePath: promptCachePath,
F16KV: *c.F16,

View File

@@ -25,7 +25,7 @@ func Rerank(ctx context.Context, request *proto.RerankRequest, loader *model.Mod
var startTime time.Time
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
startTime = time.Now()
}

View File

@@ -98,7 +98,7 @@ func SoundGeneration(
var startTime time.Time
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
startTime = time.Now()
}

View File

@@ -27,7 +27,7 @@ func ModelTokenize(s string, loader *model.ModelLoader, modelConfig config.Model
var startTime time.Time
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
startTime = time.Now()
}

View File

@@ -76,10 +76,10 @@ func ModelTranscriptionWithOptions(ctx context.Context, req TranscriptionRequest
var startTime time.Time
var audioSnippet map[string]any
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
startTime = time.Now()
// Capture audio before the backend call — the backend may delete the file.
audioSnippet = trace.AudioSnippet(req.Audio)
audioSnippet = trace.AudioSnippet(req.Audio, appConfig.TracingMaxBodyBytes)
}
r, err := transcriptionModel.AudioTranscription(ctx, req.toProto(uint32(*modelConfig.Threads)))

View File

@@ -67,7 +67,7 @@ func ModelTTS(
var startTime time.Time
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
startTime = time.Now()
}
@@ -93,7 +93,7 @@ func ModelTTS(
"language": language,
}
if err == nil && res.Success {
if snippet := trace.AudioSnippet(filePath); snippet != nil {
if snippet := trace.AudioSnippet(filePath, appConfig.TracingMaxBodyBytes); snippet != nil {
maps.Copy(data, snippet)
}
}
@@ -161,7 +161,7 @@ func ModelTTSStream(
var startTime time.Time
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
startTime = time.Now()
}
@@ -260,7 +260,7 @@ func ModelTTSStream(
"streaming": true,
}
if resultErr == nil && len(snippetPCM) > 0 {
if snippet := trace.AudioSnippetFromPCM(snippetPCM, int(sampleRate), totalPCMBytes); snippet != nil {
if snippet := trace.AudioSnippetFromPCM(snippetPCM, int(sampleRate), totalPCMBytes, appConfig.TracingMaxBodyBytes); snippet != nil {
maps.Copy(data, snippet)
}
}

View File

@@ -42,7 +42,7 @@ func VideoGeneration(height, width int32, prompt, negativePrompt, startImage, en
}
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
traceData := map[string]any{
"prompt": prompt,

View File

@@ -31,7 +31,7 @@ func VoiceAnalyze(
var startTime time.Time
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
startTime = time.Now()
}

View File

@@ -34,7 +34,7 @@ func VoiceEmbed(
var startTime time.Time
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
startTime = time.Now()
}

View File

@@ -32,7 +32,7 @@ func VoiceVerify(
var startTime time.Time
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
startTime = time.Now()
}

View File

@@ -39,19 +39,19 @@ type RunCMD struct {
LocalaiConfigDir string `env:"LOCALAI_CONFIG_DIR" type:"path" default:"${basepath}/configuration" help:"Directory for dynamic loading of certain configuration files (currently api_keys.json and external_backends.json)" group:"storage"`
LocalaiConfigDirPollInterval time.Duration `env:"LOCALAI_CONFIG_DIR_POLL_INTERVAL" help:"Typically the config path picks up changes automatically, but if your system has broken fsnotify events, set this to an interval to poll the LocalAI Config Dir (example: 1m)" group:"storage"`
// The alias on this option is there to preserve functionality with the old `--config-file` parameter
ModelsConfigFile string `env:"LOCALAI_MODELS_CONFIG_FILE,CONFIG_FILE" aliases:"config-file" help:"YAML file containing a list of model backend configs" group:"storage"`
BackendGalleries string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
Galleries string `env:"LOCALAI_GALLERIES,GALLERIES" help:"JSON list of galleries" group:"models" default:"${galleries}"`
AutoloadGalleries bool `env:"LOCALAI_AUTOLOAD_GALLERIES,AUTOLOAD_GALLERIES" group:"models" default:"true"`
AutoloadBackendGalleries bool `env:"LOCALAI_AUTOLOAD_BACKEND_GALLERIES,AUTOLOAD_BACKEND_GALLERIES" group:"backends" default:"true"`
BackendImagesReleaseTag string `env:"LOCALAI_BACKEND_IMAGES_RELEASE_TAG,BACKEND_IMAGES_RELEASE_TAG" help:"Fallback release tag for backend images" group:"backends" default:"latest"`
BackendImagesBranchTag string `env:"LOCALAI_BACKEND_IMAGES_BRANCH_TAG,BACKEND_IMAGES_BRANCH_TAG" help:"Fallback branch tag for backend images" group:"backends" default:"master"`
BackendDevSuffix string `env:"LOCALAI_BACKEND_DEV_SUFFIX,BACKEND_DEV_SUFFIX" help:"Development suffix for backend images" group:"backends" default:"development"`
ModelsConfigFile string `env:"LOCALAI_MODELS_CONFIG_FILE,CONFIG_FILE" aliases:"config-file" help:"YAML file containing a list of model backend configs" group:"storage"`
BackendGalleries string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
Galleries string `env:"LOCALAI_GALLERIES,GALLERIES" help:"JSON list of galleries" group:"models" default:"${galleries}"`
AutoloadGalleries bool `env:"LOCALAI_AUTOLOAD_GALLERIES,AUTOLOAD_GALLERIES" group:"models" default:"true"`
AutoloadBackendGalleries bool `env:"LOCALAI_AUTOLOAD_BACKEND_GALLERIES,AUTOLOAD_BACKEND_GALLERIES" group:"backends" default:"true"`
BackendImagesReleaseTag string `env:"LOCALAI_BACKEND_IMAGES_RELEASE_TAG,BACKEND_IMAGES_RELEASE_TAG" help:"Fallback release tag for backend images" group:"backends" default:"latest"`
BackendImagesBranchTag string `env:"LOCALAI_BACKEND_IMAGES_BRANCH_TAG,BACKEND_IMAGES_BRANCH_TAG" help:"Fallback branch tag for backend images" group:"backends" default:"master"`
BackendDevSuffix string `env:"LOCALAI_BACKEND_DEV_SUFFIX,BACKEND_DEV_SUFFIX" help:"Development suffix for backend images" group:"backends" default:"development"`
AutoUpgradeBackends bool `env:"LOCALAI_AUTO_UPGRADE_BACKENDS,AUTO_UPGRADE_BACKENDS" help:"Automatically upgrade backends when new versions are detected" group:"backends" default:"false"`
PreferDevelopmentBackends bool `env:"LOCALAI_PREFER_DEV_BACKENDS,PREFER_DEV_BACKENDS" help:"Prefer development backend versions (shows development backends by default in UI)" group:"backends" default:"false"`
PreloadModels string `env:"LOCALAI_PRELOAD_MODELS,PRELOAD_MODELS" help:"A List of models to apply in JSON at start" group:"models"`
Models []string `env:"LOCALAI_MODELS,MODELS" help:"A List of model configuration URLs to load" group:"models"`
PreloadModelsConfig string `env:"LOCALAI_PRELOAD_MODELS_CONFIG,PRELOAD_MODELS_CONFIG" help:"A List of models to apply at startup. Path to a YAML config file" group:"models"`
Models []string `env:"LOCALAI_MODELS,MODELS" help:"A List of model configuration URLs to load" group:"models"`
PreloadModelsConfig string `env:"LOCALAI_PRELOAD_MODELS_CONFIG,PRELOAD_MODELS_CONFIG" help:"A List of models to apply at startup. Path to a YAML config file" group:"models"`
F16 bool `name:"f16" env:"LOCALAI_F16,F16" help:"Enable GPU acceleration" group:"performance"`
Threads int `env:"LOCALAI_THREADS,THREADS" short:"t" help:"Number of threads used for parallel computation. Usage of the number of physical cores in the system is suggested" group:"performance"`
@@ -100,6 +100,7 @@ type RunCMD struct {
LoadToMemory []string `env:"LOCALAI_LOAD_TO_MEMORY,LOAD_TO_MEMORY" help:"A list of models to load into memory at startup" group:"models"`
EnableTracing bool `env:"LOCALAI_ENABLE_TRACING,ENABLE_TRACING" help:"Enable API tracing" group:"api"`
TracingMaxItems int `env:"LOCALAI_TRACING_MAX_ITEMS" default:"1024" help:"Maximum number of traces to keep" group:"api"`
TracingMaxBodyBytes int `env:"LOCALAI_TRACING_MAX_BODY_BYTES" default:"65536" help:"Maximum bytes captured per request/response body in the trace buffer (0 = uncapped). Caps memory growth from chatty endpoints like /embeddings." group:"api"`
AgentJobRetentionDays int `env:"LOCALAI_AGENT_JOB_RETENTION_DAYS,AGENT_JOB_RETENTION_DAYS" default:"30" help:"Number of days to keep agent job history (default: 30)" group:"api"`
OpenResponsesStoreTTL string `env:"LOCALAI_OPEN_RESPONSES_STORE_TTL,OPEN_RESPONSES_STORE_TTL" default:"0" help:"TTL for Open Responses store (e.g., 1h, 30m, 0 = no expiration)" group:"api"`
@@ -144,16 +145,18 @@ type RunCMD struct {
DefaultAPIKeyExpiry string `env:"LOCALAI_DEFAULT_API_KEY_EXPIRY" help:"Default expiry for API keys (e.g. 90d, 1y; empty = no expiry)" group:"auth"`
// Distributed / Horizontal Scaling
Distributed bool `env:"LOCALAI_DISTRIBUTED" default:"false" help:"Enable distributed mode (requires PostgreSQL + NATS)" group:"distributed"`
InstanceID string `env:"LOCALAI_INSTANCE_ID" help:"Unique instance ID for distributed mode (auto-generated UUID if empty)" group:"distributed"`
NatsURL string `env:"LOCALAI_NATS_URL" help:"NATS server URL (e.g., nats://localhost:4222)" group:"distributed"`
StorageURL string `env:"LOCALAI_STORAGE_URL" help:"S3-compatible storage endpoint URL (e.g., http://minio:9000)" group:"distributed"`
StorageBucket string `env:"LOCALAI_STORAGE_BUCKET" default:"localai" help:"S3 bucket name for object storage" group:"distributed"`
StorageRegion string `env:"LOCALAI_STORAGE_REGION" default:"us-east-1" help:"S3 region" group:"distributed"`
StorageAccessKey string `env:"LOCALAI_STORAGE_ACCESS_KEY" help:"S3 access key ID" group:"distributed"`
StorageSecretKey string `env:"LOCALAI_STORAGE_SECRET_KEY" help:"S3 secret access key" group:"distributed"`
RegistrationToken string `env:"LOCALAI_REGISTRATION_TOKEN" help:"Token that backend nodes must provide to register (empty = no auth required)" group:"distributed"`
AutoApproveNodes bool `env:"LOCALAI_AUTO_APPROVE_NODES" default:"false" help:"Auto-approve new worker nodes (skip admin approval)" group:"distributed"`
Distributed bool `env:"LOCALAI_DISTRIBUTED" default:"false" help:"Enable distributed mode (requires PostgreSQL + NATS)" group:"distributed"`
InstanceID string `env:"LOCALAI_INSTANCE_ID" help:"Unique instance ID for distributed mode (auto-generated UUID if empty)" group:"distributed"`
NatsURL string `env:"LOCALAI_NATS_URL" help:"NATS server URL (e.g., nats://localhost:4222)" group:"distributed"`
StorageURL string `env:"LOCALAI_STORAGE_URL" help:"S3-compatible storage endpoint URL (e.g., http://minio:9000)" group:"distributed"`
StorageBucket string `env:"LOCALAI_STORAGE_BUCKET" default:"localai" help:"S3 bucket name for object storage" group:"distributed"`
StorageRegion string `env:"LOCALAI_STORAGE_REGION" default:"us-east-1" help:"S3 region" group:"distributed"`
StorageAccessKey string `env:"LOCALAI_STORAGE_ACCESS_KEY" help:"S3 access key ID" group:"distributed"`
StorageSecretKey string `env:"LOCALAI_STORAGE_SECRET_KEY" help:"S3 secret access key" group:"distributed"`
RegistrationToken string `env:"LOCALAI_REGISTRATION_TOKEN" help:"Token that backend nodes must provide to register (empty = no auth required)" group:"distributed"`
AutoApproveNodes bool `env:"LOCALAI_AUTO_APPROVE_NODES" default:"false" help:"Auto-approve new worker nodes (skip admin approval)" group:"distributed"`
BackendInstallTimeout string `env:"LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT" help:"NATS round-trip timeout for backend.install requests sent to worker nodes (default 15m). Increase for slow links pulling multi-GB images." group:"distributed"`
BackendUpgradeTimeout string `env:"LOCALAI_NATS_BACKEND_UPGRADE_TIMEOUT" help:"NATS round-trip timeout for backend.upgrade requests (default 15m)." group:"distributed"`
Version bool
}
@@ -254,6 +257,20 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
if r.StorageSecretKey != "" {
opts = append(opts, config.WithStorageSecretKey(r.StorageSecretKey))
}
if r.BackendInstallTimeout != "" {
d, err := time.ParseDuration(r.BackendInstallTimeout)
if err != nil {
return fmt.Errorf("invalid LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT %q: %w", r.BackendInstallTimeout, err)
}
opts = append(opts, config.WithBackendInstallTimeout(d))
}
if r.BackendUpgradeTimeout != "" {
d, err := time.ParseDuration(r.BackendUpgradeTimeout)
if err != nil {
return fmt.Errorf("invalid LOCALAI_NATS_BACKEND_UPGRADE_TIMEOUT %q: %w", r.BackendUpgradeTimeout, err)
}
opts = append(opts, config.WithBackendUpgradeTimeout(d))
}
if r.RegistrationToken != "" {
opts = append(opts, config.WithRegistrationToken(r.RegistrationToken))
}
@@ -273,6 +290,7 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
opts = append(opts, config.EnableTracing)
}
opts = append(opts, config.WithTracingMaxItems(r.TracingMaxItems))
opts = append(opts, config.WithTracingMaxBodyBytes(r.TracingMaxBodyBytes))
token := ""
if r.Peer2Peer || r.Peer2PeerToken != "" {

View File

@@ -21,6 +21,7 @@ type ApplicationConfig struct {
Debug bool
EnableTracing bool
TracingMaxItems int
TracingMaxBodyBytes int // Per-body cap for captured request/response bodies; 0 disables the cap
EnableBackendLogging bool
GeneratedContentDir string
@@ -187,6 +188,7 @@ func NewApplicationConfig(o ...AppOption) *ApplicationConfig {
LRUEvictionRetryInterval: 1 * time.Second, // Default: 1 second
WatchDogInterval: 500 * time.Millisecond, // Default: 500ms
TracingMaxItems: 1024,
TracingMaxBodyBytes: 64 * 1024, // 64 KiB - caps each request/response body in the trace buffer
AgentPool: AgentPoolConfig{
Enabled: true,
Timeout: "5m",
@@ -578,6 +580,12 @@ func WithTracingMaxItems(items int) AppOption {
}
}
func WithTracingMaxBodyBytes(bytes int) AppOption {
return func(o *ApplicationConfig) {
o.TracingMaxBodyBytes = bytes
}
}
func WithGeneratedContentDir(generatedContentDir string) AppOption {
return func(o *ApplicationConfig) {
o.GeneratedContentDir = generatedContentDir
@@ -920,6 +928,7 @@ func (o *ApplicationConfig) ToRuntimeSettings() RuntimeSettings {
f16 := o.F16
debug := o.Debug
tracingMaxItems := o.TracingMaxItems
tracingMaxBodyBytes := o.TracingMaxBodyBytes
enableTracing := o.EnableTracing
enableBackendLogging := o.EnableBackendLogging
cors := o.CORS
@@ -1008,6 +1017,7 @@ func (o *ApplicationConfig) ToRuntimeSettings() RuntimeSettings {
F16: &f16,
Debug: &debug,
TracingMaxItems: &tracingMaxItems,
TracingMaxBodyBytes: &tracingMaxBodyBytes,
EnableTracing: &enableTracing,
EnableBackendLogging: &enableBackendLogging,
CORS: &cors,
@@ -1146,6 +1156,9 @@ func (o *ApplicationConfig) ApplyRuntimeSettings(settings *RuntimeSettings) (req
if settings.TracingMaxItems != nil {
o.TracingMaxItems = *settings.TracingMaxItems
}
if settings.TracingMaxBodyBytes != nil {
o.TracingMaxBodyBytes = *settings.TracingMaxBodyBytes
}
if settings.EnableBackendLogging != nil {
o.EnableBackendLogging = *settings.EnableBackendLogging
}

View File

@@ -40,7 +40,10 @@ type DistributedConfig struct {
// model-row cleanup on MarkUnhealthy / MarkDraining).
DisablePerModelHealthCheck bool
MCPCIJobTimeout time.Duration // MCP CI job execution timeout (default 10m)
MCPCIJobTimeout time.Duration // MCP CI job execution timeout (default 10m)
BackendInstallTimeout time.Duration // NATS round-trip timeout for backend.install (default 15m)
BackendUpgradeTimeout time.Duration // NATS round-trip timeout for backend.upgrade (default 15m)
MaxUploadSize int64 // Maximum upload body size in bytes (default 50 GB)
@@ -68,13 +71,15 @@ func (c DistributedConfig) Validate() error {
}
// Check for negative durations
for name, d := range map[string]time.Duration{
"mcp-tool-timeout": c.MCPToolTimeout,
"mcp-discovery-timeout": c.MCPDiscoveryTimeout,
"worker-wait-timeout": c.WorkerWaitTimeout,
"drain-timeout": c.DrainTimeout,
"health-check-interval": c.HealthCheckInterval,
"stale-node-threshold": c.StaleNodeThreshold,
"mcp-ci-job-timeout": c.MCPCIJobTimeout,
FlagMCPToolTimeout: c.MCPToolTimeout,
FlagMCPDiscoveryTimeout: c.MCPDiscoveryTimeout,
FlagWorkerWaitTimeout: c.WorkerWaitTimeout,
FlagDrainTimeout: c.DrainTimeout,
FlagHealthCheckInterval: c.HealthCheckInterval,
FlagStaleNodeThreshold: c.StaleNodeThreshold,
FlagMCPCIJobTimeout: c.MCPCIJobTimeout,
FlagBackendInstallTimeout: c.BackendInstallTimeout,
FlagBackendUpgradeTimeout: c.BackendUpgradeTimeout,
} {
if d < 0 {
return fmt.Errorf("%s must not be negative", name)
@@ -137,24 +142,66 @@ func WithStorageSecretKey(key string) AppOption {
}
}
func WithBackendInstallTimeout(d time.Duration) AppOption {
return func(o *ApplicationConfig) {
o.Distributed.BackendInstallTimeout = d
}
}
func WithBackendUpgradeTimeout(d time.Duration) AppOption {
return func(o *ApplicationConfig) {
o.Distributed.BackendUpgradeTimeout = d
}
}
var EnableAutoApproveNodes = func(o *ApplicationConfig) {
o.Distributed.AutoApproveNodes = true
}
// Flag names for distributed timeout / interval configuration. These are
// the kebab-case identifiers kong derives from the matching RunCMD struct
// fields; they appear in Validate error messages and any other operator-
// facing surface that needs to reference a specific knob by name. Keeping
// them as constants prevents the string from drifting from the actual
// flag a future rename would produce.
const (
FlagMCPToolTimeout = "mcp-tool-timeout"
FlagMCPDiscoveryTimeout = "mcp-discovery-timeout"
FlagWorkerWaitTimeout = "worker-wait-timeout"
FlagDrainTimeout = "drain-timeout"
FlagHealthCheckInterval = "health-check-interval"
FlagStaleNodeThreshold = "stale-node-threshold"
FlagMCPCIJobTimeout = "mcp-ci-job-timeout"
FlagBackendInstallTimeout = "backend-install-timeout"
FlagBackendUpgradeTimeout = "backend-upgrade-timeout"
)
// Defaults for distributed timeouts.
const (
DefaultMCPToolTimeout = 360 * time.Second
DefaultMCPDiscoveryTimeout = 60 * time.Second
DefaultWorkerWaitTimeout = 5 * time.Minute
DefaultDrainTimeout = 30 * time.Second
DefaultHealthCheckInterval = 15 * time.Second
DefaultStaleNodeThreshold = 60 * time.Second
DefaultMCPCIJobTimeout = 10 * time.Minute
DefaultMCPToolTimeout = 360 * time.Second
DefaultMCPDiscoveryTimeout = 60 * time.Second
DefaultWorkerWaitTimeout = 5 * time.Minute
DefaultDrainTimeout = 30 * time.Second
DefaultHealthCheckInterval = 15 * time.Second
DefaultStaleNodeThreshold = 60 * time.Second
DefaultMCPCIJobTimeout = 10 * time.Minute
DefaultBackendInstallTimeout = 15 * time.Minute
DefaultBackendUpgradeTimeout = 15 * time.Minute
)
// DefaultMaxUploadSize is the default maximum upload body size (50 GB).
const DefaultMaxUploadSize int64 = 50 << 30
// BackendInstallTimeoutOrDefault returns the configured timeout or the default.
func (c DistributedConfig) BackendInstallTimeoutOrDefault() time.Duration {
return cmp.Or(c.BackendInstallTimeout, DefaultBackendInstallTimeout)
}
// BackendUpgradeTimeoutOrDefault returns the configured timeout or the default.
func (c DistributedConfig) BackendUpgradeTimeoutOrDefault() time.Duration {
return cmp.Or(c.BackendUpgradeTimeout, DefaultBackendUpgradeTimeout)
}
// MCPToolTimeoutOrDefault returns the configured timeout or the default.
func (c DistributedConfig) MCPToolTimeoutOrDefault() time.Duration {
return cmp.Or(c.MCPToolTimeout, DefaultMCPToolTimeout)

View File

@@ -0,0 +1,90 @@
package config_test
import (
"time"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
)
var _ = Describe("DistributedConfig backend NATS timeouts", func() {
Context("BackendInstallTimeoutOrDefault", func() {
It("returns 15 minutes when unset", func() {
c := config.DistributedConfig{}
Expect(c.BackendInstallTimeoutOrDefault()).To(Equal(15 * time.Minute))
})
It("returns the configured value when set", func() {
c := config.DistributedConfig{BackendInstallTimeout: 42 * time.Minute}
Expect(c.BackendInstallTimeoutOrDefault()).To(Equal(42 * time.Minute))
})
})
Context("BackendUpgradeTimeoutOrDefault", func() {
It("returns 15 minutes when unset", func() {
c := config.DistributedConfig{}
Expect(c.BackendUpgradeTimeoutOrDefault()).To(Equal(15 * time.Minute))
})
It("returns the configured value when set", func() {
c := config.DistributedConfig{BackendUpgradeTimeout: 30 * time.Minute}
Expect(c.BackendUpgradeTimeoutOrDefault()).To(Equal(30 * time.Minute))
})
})
})
var _ = Describe("DistributedConfig flag-name constants", func() {
// Pin the kebab-case strings so a rename of the Go field name (or a
// CLI flag naming convention change) forces the constant to update,
// keeping the Validate error messages and any future operator-facing
// surface in sync with the actual CLI flag.
DescribeTable("flag name constants",
func(actual, expected string) {
Expect(actual).To(Equal(expected))
},
Entry("MCP tool timeout", config.FlagMCPToolTimeout, "mcp-tool-timeout"),
Entry("MCP discovery timeout", config.FlagMCPDiscoveryTimeout, "mcp-discovery-timeout"),
Entry("worker wait timeout", config.FlagWorkerWaitTimeout, "worker-wait-timeout"),
Entry("drain timeout", config.FlagDrainTimeout, "drain-timeout"),
Entry("health check interval", config.FlagHealthCheckInterval, "health-check-interval"),
Entry("stale node threshold", config.FlagStaleNodeThreshold, "stale-node-threshold"),
Entry("MCP CI job timeout", config.FlagMCPCIJobTimeout, "mcp-ci-job-timeout"),
Entry("backend install timeout", config.FlagBackendInstallTimeout, "backend-install-timeout"),
Entry("backend upgrade timeout", config.FlagBackendUpgradeTimeout, "backend-upgrade-timeout"),
)
})
var _ = Describe("DistributedConfig.Validate negative-duration errors", func() {
It("rejects a negative BackendInstallTimeout with the flag name in the error", func() {
c := config.DistributedConfig{
Enabled: true,
NatsURL: "nats://localhost:4222",
BackendInstallTimeout: -1 * time.Second,
}
err := c.Validate()
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring(config.FlagBackendInstallTimeout))
Expect(err.Error()).To(ContainSubstring("must not be negative"))
})
It("rejects a negative BackendUpgradeTimeout with the flag name in the error", func() {
c := config.DistributedConfig{
Enabled: true,
NatsURL: "nats://localhost:4222",
BackendUpgradeTimeout: -1 * time.Second,
}
err := c.Validate()
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring(config.FlagBackendUpgradeTimeout))
})
It("accepts all-zero durations as valid (defaults apply)", func() {
c := config.DistributedConfig{
Enabled: true,
NatsURL: "nats://localhost:4222",
}
Expect(c.Validate()).To(Succeed())
})
})

View File

@@ -136,4 +136,36 @@ var _ = Describe("Backend hooks and parser defaults", func() {
Expect(cfg.EngineArgs["enable_chunked_prefill"]).To(Equal(true))
})
})
Context("PromptCacheAll default", func() {
It("defaults to true when omitted from YAML", func() {
cfg := &ModelConfig{}
cfg.SetDefaults()
Expect(cfg.PromptCacheAll).NotTo(BeNil())
Expect(*cfg.PromptCacheAll).To(BeTrue())
})
It("preserves an explicit false from YAML", func() {
falseV := false
cfg := &ModelConfig{
LLMConfig: LLMConfig{PromptCacheAll: &falseV},
}
cfg.SetDefaults()
Expect(cfg.PromptCacheAll).NotTo(BeNil())
Expect(*cfg.PromptCacheAll).To(BeFalse())
})
It("preserves an explicit true from YAML", func() {
trueV := true
cfg := &ModelConfig{
LLMConfig: LLMConfig{PromptCacheAll: &trueV},
}
cfg.SetDefaults()
Expect(cfg.PromptCacheAll).NotTo(BeNil())
Expect(*cfg.PromptCacheAll).To(BeTrue())
})
})
})

View File

@@ -209,7 +209,7 @@ type LLMConfig struct {
RMSNormEps float32 `yaml:"rms_norm_eps,omitempty" json:"rms_norm_eps,omitempty"`
NGQA int32 `yaml:"ngqa,omitempty" json:"ngqa,omitempty"`
PromptCachePath string `yaml:"prompt_cache_path,omitempty" json:"prompt_cache_path,omitempty"`
PromptCacheAll bool `yaml:"prompt_cache_all,omitempty" json:"prompt_cache_all,omitempty"`
PromptCacheAll *bool `yaml:"prompt_cache_all,omitempty" json:"prompt_cache_all,omitempty"`
PromptCacheRO bool `yaml:"prompt_cache_ro,omitempty" json:"prompt_cache_ro,omitempty"`
MirostatETA *float64 `yaml:"mirostat_eta,omitempty" json:"mirostat_eta,omitempty"`
MirostatTAU *float64 `yaml:"mirostat_tau,omitempty" json:"mirostat_tau,omitempty"`
@@ -494,6 +494,13 @@ func (cfg *ModelConfig) SetDefaults(opts ...ConfigLoaderOption) {
cfg.Reranking = &falseV
}
if cfg.PromptCacheAll == nil {
// Match upstream llama.cpp's default (common/common.h: cache_prompt = true)
// and let cache_idle_slots / kv_unified actually do useful work; users can
// opt out with an explicit `prompt_cache_all: false` in the model YAML.
cfg.PromptCacheAll = &trueV
}
if threads == 0 {
// Threads can't be 0
threads = 4

View File

@@ -38,6 +38,7 @@ type RuntimeSettings struct {
Debug *bool `json:"debug,omitempty"`
EnableTracing *bool `json:"enable_tracing,omitempty"`
TracingMaxItems *int `json:"tracing_max_items,omitempty"`
TracingMaxBodyBytes *int `json:"tracing_max_body_bytes,omitempty"` // Per-body cap in bytes; 0 disables the cap
EnableBackendLogging *bool `json:"enable_backend_logging,omitempty"`
// Security/CORS settings

View File

@@ -28,6 +28,7 @@ import (
"github.com/mudler/LocalAI/core/services/monitoring"
"github.com/mudler/LocalAI/core/services/nodes"
"github.com/mudler/LocalAI/core/services/quantization"
"github.com/mudler/LocalAI/pkg/signals"
"github.com/mudler/xlog"
)
@@ -267,9 +268,12 @@ func API(application *application.Application) (*echo.Echo, error) {
e.Static("/generated-videos", videoPath)
}
// Initialize usage recording when auth DB is available
// Initialize usage recording when auth DB is available, and ensure the
// batcher drains its in-memory queue on graceful shutdown so the last
// few seconds of usage don't disappear when the process exits.
if application.AuthDB() != nil {
httpMiddleware.InitUsageRecorder(application.AuthDB())
signals.RegisterGracefulTerminationHandler(httpMiddleware.ShutdownUsageRecorder)
}
// Auth is applied to _all_ endpoints. Filtering out endpoints to bypass is
@@ -403,7 +407,7 @@ func API(application *application.Application) (*echo.Echo, error) {
}
}
routes.RegisterNodeSelfServiceRoutes(e, registry, distCfg.RegistrationToken, distCfg.AutoApproveNodes, application.AuthDB(), application.ApplicationConfig().Auth.APIKeyHMACSecret)
routes.RegisterNodeAdminRoutes(e, registry, remoteUnloader, adminMiddleware, application.AuthDB(), application.ApplicationConfig().Auth.APIKeyHMACSecret, application.ApplicationConfig().Distributed.RegistrationToken)
routes.RegisterNodeAdminRoutes(e, registry, remoteUnloader, application.GalleryService(), opcache, application.ApplicationConfig(), adminMiddleware, application.AuthDB(), application.ApplicationConfig().Auth.APIKeyHMACSecret, application.ApplicationConfig().Distributed.RegistrationToken)
// Distributed SSE routes (job progress + agent events via NATS)
if d := application.Distributed(); d != nil {

View File

@@ -38,9 +38,15 @@ func InitDB(databaseURL string) (*gorm.DB, error) {
}
// Backfill: users created before the provider column existed have an empty
// provider treat them as local accounts so the UI can identify them.
// provider - treat them as local accounts so the UI can identify them.
db.Exec("UPDATE users SET provider = ? WHERE provider = '' OR provider IS NULL", ProviderLocal)
// Backfill: pre-feature usage_records have no source column. Classify them so the
// new per-source aggregators include them.
if err := BackfillUsageSource(db); err != nil {
return nil, fmt.Errorf("failed to backfill usage source: %w", err)
}
// Create composite index on users(provider, subject) for fast OAuth lookups
if err := db.Exec("CREATE INDEX IF NOT EXISTS idx_users_provider_subject ON users(provider, subject)").Error; err != nil {
// Ignore error on postgres if index already exists

View File

@@ -16,8 +16,10 @@ import (
)
const (
contextKeyUser = "auth_user"
contextKeyRole = "auth_role"
contextKeyUser = "auth_user"
contextKeyRole = "auth_role"
contextKeyAPIKey = "auth_apikey"
contextKeySource = "auth_source"
)
// Middleware returns an Echo middleware that handles authentication.
@@ -75,6 +77,7 @@ func Middleware(db *gorm.DB, appConfig *config.ApplicationConfig) echo.Middlewar
}
c.Set(contextKeyUser, syntheticUser)
c.Set(contextKeyRole, RoleAdmin)
c.Set(contextKeySource, UsageSourceLegacy)
authenticated = true
}
}
@@ -213,6 +216,20 @@ func GetUserRole(c echo.Context) string {
return role
}
// GetAPIKey returns the resolved API key from the echo context, or nil.
// Nil for session-cookie and legacy-env-key authentication.
func GetAPIKey(c echo.Context) *UserAPIKey {
k, _ := c.Get(contextKeyAPIKey).(*UserAPIKey)
return k
}
// GetSource returns the request's authentication source: UsageSourceAPIKey,
// UsageSourceWeb, UsageSourceLegacy, or empty if no authentication was performed.
func GetSource(c echo.Context) string {
s, _ := c.Get(contextKeySource).(string)
return s
}
// RequireRouteFeature returns a global middleware that checks the user has access
// to the feature required by the matched route. It uses the RouteFeatureRegistry
// to look up the required feature for each route pattern + HTTP method.
@@ -421,47 +438,67 @@ func RequireQuota(db *gorm.DB) echo.MiddlewareFunc {
}
// tryAuthenticate attempts to authenticate the request using the database.
//
// On success it returns the user and, as a side effect, sets the following
// values on the Echo context:
// - contextKeySource ("auth_source"): always set, one of UsageSourceWeb /
// UsageSourceAPIKey. UsageSourceLegacy is set elsewhere by the parent
// Middleware when a legacy env key matches.
// - contextKeyAPIKey ("auth_apikey"): set to the resolved *UserAPIKey for
// named-key branches (Bearer, x-api-key, xi-api-key, token cookie).
// - "_auth_session": session record, used by Middleware to drive cookie
// rotation. Only set on the session-cookie branch.
//
// contextKeyUser and contextKeyRole are populated by the parent Middleware
// after this function returns.
func tryAuthenticate(c echo.Context, db *gorm.DB, appConfig *config.ApplicationConfig) *User {
hmacSecret := appConfig.Auth.APIKeyHMACSecret
// a. Session cookie
// a. Session cookie -> web UI
if cookie, err := c.Cookie(sessionCookie); err == nil && cookie.Value != "" {
if user, session := ValidateSession(db, cookie.Value, hmacSecret); user != nil {
// Store session for rotation check in middleware
c.Set("_auth_session", session)
c.Set(contextKeySource, UsageSourceWeb)
return user
}
}
// b. Authorization: Bearer token
// b. Authorization: Bearer
authHeader := c.Request().Header.Get("Authorization")
if strings.HasPrefix(authHeader, "Bearer ") {
token := strings.TrimPrefix(authHeader, "Bearer ")
// Try as session ID first
// b1. Session token via Bearer -> still web UI
if user, _ := ValidateSession(db, token, hmacSecret); user != nil {
c.Set(contextKeySource, UsageSourceWeb)
return user
}
// Try as user API key
// b2. Named API key
if key, err := ValidateAPIKey(db, token, hmacSecret); err == nil {
c.Set(contextKeySource, UsageSourceAPIKey)
c.Set(contextKeyAPIKey, key)
return &key.User
}
}
// c. x-api-key / xi-api-key headers
// c. x-api-key / xi-api-key -> named API key
for _, header := range []string{"x-api-key", "xi-api-key"} {
if key := c.Request().Header.Get(header); key != "" {
if apiKey, err := ValidateAPIKey(db, key, hmacSecret); err == nil {
if k := c.Request().Header.Get(header); k != "" {
if apiKey, err := ValidateAPIKey(db, k, hmacSecret); err == nil {
c.Set(contextKeySource, UsageSourceAPIKey)
c.Set(contextKeyAPIKey, apiKey)
return &apiKey.User
}
}
}
// d. token cookie (legacy)
// d. token cookie -> named API key
if cookie, err := c.Cookie("token"); err == nil && cookie.Value != "" {
// Try as user API key
if key, err := ValidateAPIKey(db, cookie.Value, hmacSecret); err == nil {
c.Set(contextKeySource, UsageSourceAPIKey)
c.Set(contextKeyAPIKey, key)
return &key.User
}
}

View File

@@ -303,4 +303,122 @@ var _ = Describe("Auth Middleware", func() {
}
})
})
Describe("auth context plumbing for usage source", func() {
// probeApp builds a minimal echo app with the auth middleware and a single
// "/probe" route that captures the user, source, and apikey from context.
type probe struct {
user *auth.User
source string
key *auth.UserAPIKey
}
probeApp := func(db *gorm.DB, appConfig *config.ApplicationConfig, p *probe) *echo.Echo {
e := echo.New()
e.Use(auth.Middleware(db, appConfig))
e.GET("/probe", func(c echo.Context) error {
p.user = auth.GetUser(c)
p.source = auth.GetSource(c)
p.key = auth.GetAPIKey(c)
return c.NoContent(http.StatusOK)
})
return e
}
It("session cookie sets source=web, apikey=nil", func() {
db := testDB()
appConfig := config.NewApplicationConfig()
user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
token := createTestSession(db, user.ID)
var p probe
app := probeApp(db, appConfig, &p)
rec := doRequest(app, http.MethodGet, "/probe", withSessionCookie(token))
Expect(rec.Code).To(Equal(http.StatusOK))
Expect(p.user).ToNot(BeNil())
Expect(p.user.ID).To(Equal(user.ID))
Expect(p.source).To(Equal(auth.UsageSourceWeb))
Expect(p.key).To(BeNil())
})
It("Bearer session token sets source=web, apikey=nil", func() {
db := testDB()
appConfig := config.NewApplicationConfig()
user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
token := createTestSession(db, user.ID)
var p probe
app := probeApp(db, appConfig, &p)
rec := doRequest(app, http.MethodGet, "/probe", withBearerToken(token))
Expect(rec.Code).To(Equal(http.StatusOK))
Expect(p.user).ToNot(BeNil())
Expect(p.user.ID).To(Equal(user.ID))
Expect(p.source).To(Equal(auth.UsageSourceWeb))
Expect(p.key).To(BeNil())
})
It("Bearer API key sets source=apikey and exposes the resolved *UserAPIKey", func() {
db := testDB()
appConfig := config.NewApplicationConfig()
user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
plaintext, key, err := auth.CreateAPIKey(db, user.ID, "ci", auth.RoleUser, appConfig.Auth.APIKeyHMACSecret, nil)
Expect(err).ToNot(HaveOccurred())
var p probe
app := probeApp(db, appConfig, &p)
rec := doRequest(app, http.MethodGet, "/probe", withBearerToken(plaintext))
Expect(rec.Code).To(Equal(http.StatusOK))
Expect(p.source).To(Equal(auth.UsageSourceAPIKey))
Expect(p.key).ToNot(BeNil())
Expect(p.key.ID).To(Equal(key.ID))
})
It("x-api-key header sets source=apikey", func() {
db := testDB()
appConfig := config.NewApplicationConfig()
user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
plaintext, _, err := auth.CreateAPIKey(db, user.ID, "ci", auth.RoleUser, appConfig.Auth.APIKeyHMACSecret, nil)
Expect(err).ToNot(HaveOccurred())
var p probe
app := probeApp(db, appConfig, &p)
rec := doRequest(app, http.MethodGet, "/probe", withXApiKey(plaintext))
Expect(rec.Code).To(Equal(http.StatusOK))
Expect(p.source).To(Equal(auth.UsageSourceAPIKey))
Expect(p.key).ToNot(BeNil())
})
It("token cookie sets source=apikey", func() {
db := testDB()
appConfig := config.NewApplicationConfig()
user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
plaintext, _, err := auth.CreateAPIKey(db, user.ID, "ci", auth.RoleUser, appConfig.Auth.APIKeyHMACSecret, nil)
Expect(err).ToNot(HaveOccurred())
var p probe
app := probeApp(db, appConfig, &p)
rec := doRequest(app, http.MethodGet, "/probe", withTokenCookie(plaintext))
Expect(rec.Code).To(Equal(http.StatusOK))
Expect(p.source).To(Equal(auth.UsageSourceAPIKey))
Expect(p.key).ToNot(BeNil())
})
It("legacy env key sets source=legacy, apikey=nil", func() {
db := testDB()
appConfig := config.NewApplicationConfig()
appConfig.ApiKeys = []string{"legacy-secret"}
var p probe
app := probeApp(db, appConfig, &p)
rec := doRequest(app, http.MethodGet, "/probe", withBearerToken("legacy-secret"))
Expect(rec.Code).To(Equal(http.StatusOK))
Expect(p.source).To(Equal(auth.UsageSourceLegacy))
Expect(p.key).To(BeNil())
})
})
})

View File

@@ -5,14 +5,31 @@ import (
"strings"
"time"
"github.com/mudler/xlog"
"gorm.io/gorm"
)
// Source classification for a UsageRecord.
const (
UsageSourceAPIKey = "apikey" // request authenticated with a named UserAPIKey
UsageSourceWeb = "web" // request authenticated with a session cookie (web UI)
UsageSourceLegacy = "legacy" // request authenticated with an env-configured legacy key
)
// UsageRecord represents a single API request's token usage.
type UsageRecord struct {
ID uint `gorm:"primaryKey;autoIncrement"`
UserID string `gorm:"size:36;index:idx_usage_user_time"`
UserName string `gorm:"size:255"`
ID uint `gorm:"primaryKey;autoIncrement"`
UserID string `gorm:"size:36;index:idx_usage_user_time"`
UserName string `gorm:"size:255"`
// Source classifies how the request authenticated. One of UsageSource* constants.
// Empty for pre-feature rows until the InitDB backfill runs.
Source string `gorm:"size:16;index:idx_usage_source"`
// APIKeyID is the UserAPIKey.ID when Source == UsageSourceAPIKey. Nil otherwise.
APIKeyID *string `gorm:"size:36;index:idx_usage_apikey"`
// APIKeyName is a snapshot of UserAPIKey.Name at write time. Survives key deletion.
APIKeyName string `gorm:"size:255"`
Model string `gorm:"size:255;index"`
Endpoint string `gorm:"size:255"`
PromptTokens int64
@@ -30,9 +47,12 @@ func RecordUsage(db *gorm.DB, record *UsageRecord) error {
// UsageBucket is an aggregated time bucket for the dashboard.
type UsageBucket struct {
Bucket string `json:"bucket"`
Model string `json:"model"`
Model string `json:"model,omitempty"`
UserID string `json:"user_id,omitempty"`
UserName string `json:"user_name,omitempty"`
Source string `json:"source,omitempty"`
APIKeyID string `json:"api_key_id,omitempty"`
APIKeyName string `json:"api_key_name,omitempty"`
PromptTokens int64 `json:"prompt_tokens"`
CompletionTokens int64 `json:"completion_tokens"`
TotalTokens int64 `json:"total_tokens"`
@@ -119,6 +139,28 @@ func GetUserUsage(db *gorm.DB, userID, period string) ([]UsageBucket, error) {
return buckets, nil
}
// BackfillUsageSource sets the Source column on pre-feature usage rows.
// Idempotent: only touches rows where source is NULL or empty.
// - rows whose user_id == "legacy-api-key" -> UsageSourceLegacy
// - everything else -> UsageSourceWeb
func BackfillUsageSource(db *gorm.DB) error {
// Legacy first (more specific predicate)
if err := db.Exec(
`UPDATE usage_records SET source = ? WHERE (source IS NULL OR source = '') AND user_id = ?`,
UsageSourceLegacy, "legacy-api-key",
).Error; err != nil {
return fmt.Errorf("backfill legacy usage source: %w", err)
}
// Everything else -> web
if err := db.Exec(
`UPDATE usage_records SET source = ? WHERE (source IS NULL OR source = '')`,
UsageSourceWeb,
).Error; err != nil {
return fmt.Errorf("backfill web usage source: %w", err)
}
return nil
}
// GetAllUsage returns aggregated usage for all users (admin). Optional userID filter.
func GetAllUsage(db *gorm.DB, period, userID string) ([]UsageBucket, error) {
sqlite := isSQLiteDB(db)
@@ -149,3 +191,257 @@ func GetAllUsage(db *gorm.DB, period, userID string) ([]UsageBucket, error) {
}
return buckets, nil
}
// TotalsEntry is a token+request roll-up.
type TotalsEntry struct {
Tokens int64 `json:"tokens"`
Requests int64 `json:"requests"`
}
// KeyTotal is the per-key roll-up returned by sources endpoints. UserID and
// UserName are snapshotted from the UsageRecord so revoked-and-deleted keys
// still carry their owner attribution in admin views.
type KeyTotal struct {
APIKeyID string `json:"api_key_id"`
APIKeyName string `json:"api_key_name"`
UserID string `json:"user_id"`
UserName string `json:"user_name"`
Tokens int64 `json:"tokens"`
Requests int64 `json:"requests"`
LastUsed time.Time `json:"last_used"`
}
// UserSourceTotal is a per-(user, source) roll-up for sources that don't carry
// a named API key identity (web, legacy). It exists so admin views can show
// which user generated each block of Web UI / legacy traffic; the per-apikey
// breakdown for source=apikey already lives in KeyTotal.
type UserSourceTotal struct {
Source string `json:"source"`
UserID string `json:"user_id"`
UserName string `json:"user_name"`
Tokens int64 `json:"tokens"`
Requests int64 `json:"requests"`
}
// SourceTotals summarises a per-source breakdown.
type SourceTotals struct {
BySource map[string]TotalsEntry `json:"by_source"`
ByKey []KeyTotal `json:"by_key"` // server-sorted desc by tokens, capped
ByUserSource []UserSourceTotal `json:"by_user_source,omitempty"` // populated only when includeLegacy=true
GrandTotal TotalsEntry `json:"grand_total"`
}
const maxKeyTotals = 200
// GetUserUsageBySource returns per-source aggregated usage for one user. Legacy
// is excluded by design (visible to admins only via the admin variant).
func GetUserUsageBySource(db *gorm.DB, userID, period string) ([]UsageBucket, SourceTotals, error) {
sqlite := isSQLiteDB(db)
since, dateFmt := periodToWindow(period, sqlite)
bucketExpr := fmt.Sprintf("%s as bucket", dateFmt)
query := db.Model(&UsageRecord{}).
Select(bucketExpr+", source, COALESCE(api_key_id, '') as api_key_id, api_key_name, "+
"SUM(prompt_tokens) as prompt_tokens, "+
"SUM(completion_tokens) as completion_tokens, "+
"SUM(total_tokens) as total_tokens, "+
"COUNT(*) as request_count").
Where("user_id = ?", userID).
Where("source <> ?", UsageSourceLegacy).
Group("bucket, source, api_key_id, api_key_name").
Order("bucket ASC")
if !since.IsZero() {
query = query.Where("created_at >= ?", since)
}
var buckets []UsageBucket
if err := query.Find(&buckets).Error; err != nil {
return nil, SourceTotals{}, err
}
totals := computeSourceTotals(db, userID, "", since, false)
return buckets, totals, nil
}
// computeSourceTotals rolls up by_source / by_key / grand_total.
// userID/apiKeyID are optional filters. includeLegacy controls whether the
// legacy bucket is exposed (admin-only).
func computeSourceTotals(db *gorm.DB, userID, apiKeyID string, since time.Time, includeLegacy bool) SourceTotals {
totals := SourceTotals{BySource: map[string]TotalsEntry{}}
bySourceQ := db.Model(&UsageRecord{}).
Select("source, SUM(total_tokens) as tokens, COUNT(*) as requests").
Group("source")
bySourceQ = applyFilters(bySourceQ, userID, apiKeyID, since, includeLegacy)
var bySourceRows []struct {
Source string
Tokens int64
Requests int64
}
if err := bySourceQ.Scan(&bySourceRows).Error; err != nil {
xlog.Warn("computeSourceTotals: by-source Scan failed", "error", err)
return totals
}
for _, r := range bySourceRows {
totals.BySource[r.Source] = TotalsEntry{Tokens: r.Tokens, Requests: r.Requests}
totals.GrandTotal.Tokens += r.Tokens
totals.GrandTotal.Requests += r.Requests
}
byKeyQ := db.Model(&UsageRecord{}).
Select("COALESCE(api_key_id, '') as api_key_id, api_key_name, "+
"user_id, user_name, "+
"SUM(total_tokens) as tokens, COUNT(*) as requests, MAX(created_at) as last_used").
Where("api_key_id IS NOT NULL AND api_key_id <> ''").
Group("api_key_id, api_key_name, user_id, user_name").
Order("tokens DESC").
Limit(maxKeyTotals)
byKeyQ = applyFilters(byKeyQ, userID, apiKeyID, since, includeLegacy)
// Iterate Rows() manually because MAX(created_at) is returned as a string by
// the SQLite driver, and Go's database/sql refuses to scan that into
// *time.Time. Postgres returns a proper timestamp. We accept both shapes
// via a Rows.Scan into a string column, then parse uniformly.
rows, err := byKeyQ.Rows()
if err != nil {
xlog.Warn("computeSourceTotals: by-key Rows() failed", "error", err)
} else {
defer func() { _ = rows.Close() }()
out := make([]KeyTotal, 0)
for rows.Next() {
var (
apiKeyID, apiKeyName, userIDCol, userName, lastUsedRaw string
tokens, requests int64
)
if scanErr := rows.Scan(&apiKeyID, &apiKeyName, &userIDCol, &userName, &tokens, &requests, &lastUsedRaw); scanErr != nil {
continue
}
out = append(out, KeyTotal{
APIKeyID: apiKeyID,
APIKeyName: apiKeyName,
UserID: userIDCol,
UserName: userName,
Tokens: tokens,
Requests: requests,
LastUsed: parseLastUsedString(lastUsedRaw),
})
}
if rerr := rows.Err(); rerr != nil {
xlog.Warn("computeSourceTotals: by-key rows iteration failed", "error", rerr)
}
totals.ByKey = out
}
// by_user_source: only populated for admin callers (includeLegacy=true) so
// they can attribute Web UI / legacy traffic to specific users. Per-apikey
// rows already carry user info via KeyTotal above, so this query only
// covers source != apikey.
if includeLegacy {
byUserSourceQ := db.Model(&UsageRecord{}).
Select("source, user_id, user_name, "+
"SUM(total_tokens) as tokens, COUNT(*) as requests").
Where("source <> ?", UsageSourceAPIKey).
Group("source, user_id, user_name").
Order("tokens DESC")
byUserSourceQ = applyFilters(byUserSourceQ, userID, apiKeyID, since, includeLegacy)
var byUserSourceRows []UserSourceTotal
if scanErr := byUserSourceQ.Scan(&byUserSourceRows).Error; scanErr != nil {
xlog.Warn("computeSourceTotals: by-user-source Scan failed", "error", scanErr)
} else {
totals.ByUserSource = byUserSourceRows
}
}
return totals
}
// parseLastUsedString converts the textual MAX(created_at) value returned by
// SQLite (or any driver that surfaces the timestamp as a string) into a
// time.Time. Returns the zero time on parse failure.
func parseLastUsedString(s string) time.Time {
if s == "" {
return time.Time{}
}
// GORM's SQLite driver emits Go's default time formatting. Try the formats
// it commonly produces, falling back to RFC3339Nano.
layouts := []string{
"2006-01-02 15:04:05.999999999 -0700 MST",
"2006-01-02 15:04:05.999999999-07:00",
"2006-01-02 15:04:05.999999999",
"2006-01-02 15:04:05",
time.RFC3339Nano,
time.RFC3339,
}
for _, layout := range layouts {
if t, err := time.Parse(layout, s); err == nil {
return t
}
}
xlog.Warn("parseLastUsedString: unrecognised format", "value", s)
return time.Time{}
}
// GetAllUsageBySource is the admin variant of GetUserUsageBySource.
// Optional filters: userID and apiKeyID. Legacy is included.
// truncated == true iff the per-key roll-up was capped at maxKeyTotals.
func GetAllUsageBySource(db *gorm.DB, period, userID, apiKeyID string) ([]UsageBucket, SourceTotals, bool, error) {
sqlite := isSQLiteDB(db)
since, dateFmt := periodToWindow(period, sqlite)
bucketExpr := fmt.Sprintf("%s as bucket", dateFmt)
query := db.Model(&UsageRecord{}).
Select(bucketExpr+", source, COALESCE(api_key_id, '') as api_key_id, api_key_name, "+
"user_id, user_name, "+
"SUM(prompt_tokens) as prompt_tokens, "+
"SUM(completion_tokens) as completion_tokens, "+
"SUM(total_tokens) as total_tokens, "+
"COUNT(*) as request_count").
Group("bucket, source, api_key_id, api_key_name, user_id, user_name").
Order("bucket ASC")
query = applyFilters(query, userID, apiKeyID, since, true)
var buckets []UsageBucket
if err := query.Find(&buckets).Error; err != nil {
return nil, SourceTotals{}, false, err
}
totals := computeSourceTotals(db, userID, apiKeyID, since, true)
// Count distinct api_key_ids matching the filters. If > maxKeyTotals,
// the by_key slice was capped and we signal truncation to the caller.
truncated := false
var distinct int64
countQ := applyFilters(
db.Model(&UsageRecord{}).
Distinct("api_key_id").
Where("api_key_id IS NOT NULL AND api_key_id <> ''"),
userID, apiKeyID, since, true,
)
if err := countQ.Count(&distinct).Error; err != nil {
xlog.Warn("GetAllUsageBySource: distinct api_key_id count failed", "error", err)
} else {
truncated = distinct > maxKeyTotals
}
return buckets, totals, truncated, nil
}
func applyFilters(q *gorm.DB, userID, apiKeyID string, since time.Time, includeLegacy bool) *gorm.DB {
if userID != "" {
q = q.Where("user_id = ?", userID)
}
if apiKeyID != "" {
q = q.Where("api_key_id = ?", apiKeyID)
}
if !since.IsZero() {
q = q.Where("created_at >= ?", since)
}
if !includeLegacy {
q = q.Where("source <> ?", UsageSourceLegacy)
}
return q
}

View File

@@ -3,11 +3,13 @@
package auth_test
import (
"fmt"
"time"
"github.com/mudler/LocalAI/core/http/auth"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"gorm.io/gorm"
)
var _ = Describe("Usage", func() {
@@ -158,4 +160,275 @@ var _ = Describe("Usage", func() {
}
})
})
Describe("Usage source backfill", func() {
It("backfills 'web' for pre-feature rows", func() {
db := testDB()
rawDB, err := db.DB()
Expect(err).ToNot(HaveOccurred())
_, err = rawDB.Exec(
`INSERT INTO usage_records (user_id, source, model, created_at, total_tokens, prompt_tokens, completion_tokens, duration) VALUES (?, '', ?, ?, 0, 0, 0, 0)`,
"user-x", "gpt-4", time.Now())
Expect(err).ToNot(HaveOccurred())
Expect(auth.BackfillUsageSource(db)).To(Succeed())
var loaded auth.UsageRecord
Expect(db.Where("user_id = ?", "user-x").First(&loaded).Error).To(Succeed())
Expect(loaded.Source).To(Equal(auth.UsageSourceWeb))
})
It("backfills 'legacy' for pre-feature rows with legacy-api-key user_id", func() {
db := testDB()
rawDB, err := db.DB()
Expect(err).ToNot(HaveOccurred())
_, err = rawDB.Exec(
`INSERT INTO usage_records (user_id, source, model, created_at, total_tokens, prompt_tokens, completion_tokens, duration) VALUES (?, '', ?, ?, 0, 0, 0, 0)`,
"legacy-api-key", "gpt-4", time.Now())
Expect(err).ToNot(HaveOccurred())
Expect(auth.BackfillUsageSource(db)).To(Succeed())
var loaded auth.UsageRecord
Expect(db.Where("user_id = ?", "legacy-api-key").First(&loaded).Error).To(Succeed())
Expect(loaded.Source).To(Equal(auth.UsageSourceLegacy))
})
It("is idempotent on re-run", func() {
db := testDB()
Expect(auth.BackfillUsageSource(db)).To(Succeed())
Expect(auth.BackfillUsageSource(db)).To(Succeed())
})
})
Describe("UsageRecord with source fields", func() {
It("persists Source, APIKeyID, APIKeyName", func() {
db := testDB()
keyID := "key-uuid-1"
record := &auth.UsageRecord{
UserID: "user-1",
UserName: "Test User",
Source: auth.UsageSourceAPIKey,
APIKeyID: &keyID,
APIKeyName: "ci-runner",
Model: "gpt-4",
Endpoint: "/v1/chat/completions",
TotalTokens: 150,
CreatedAt: time.Now(),
}
Expect(auth.RecordUsage(db, record)).To(Succeed())
var loaded auth.UsageRecord
Expect(db.First(&loaded, record.ID).Error).To(Succeed())
Expect(loaded.Source).To(Equal(auth.UsageSourceAPIKey))
Expect(loaded.APIKeyID).ToNot(BeNil())
Expect(*loaded.APIKeyID).To(Equal("key-uuid-1"))
Expect(loaded.APIKeyName).To(Equal("ci-runner"))
})
It("allows nil APIKeyID for web/legacy sources", func() {
db := testDB()
record := &auth.UsageRecord{
UserID: "user-1",
Source: auth.UsageSourceWeb,
Model: "gpt-4",
CreatedAt: time.Now(),
}
Expect(auth.RecordUsage(db, record)).To(Succeed())
var loaded auth.UsageRecord
Expect(db.First(&loaded, record.ID).Error).To(Succeed())
Expect(loaded.Source).To(Equal(auth.UsageSourceWeb))
Expect(loaded.APIKeyID).To(BeNil())
Expect(loaded.APIKeyName).To(BeEmpty())
})
})
Describe("GetUserUsageBySource", func() {
insert := func(db *gorm.DB, userID, source, keyID, keyName string, tokens int64, when time.Time) {
rec := &auth.UsageRecord{
UserID: userID,
Source: source,
Model: "gpt-4",
TotalTokens: tokens,
CreatedAt: when,
}
if keyID != "" {
rec.APIKeyID = &keyID
rec.APIKeyName = keyName
}
Expect(auth.RecordUsage(db, rec)).To(Succeed())
}
It("returns only the caller's rows, never legacy", func() {
db := testDB()
now := time.Now()
insert(db, "alice", auth.UsageSourceAPIKey, "k1", "ci", 100, now)
insert(db, "alice", auth.UsageSourceWeb, "", "", 50, now)
insert(db, "alice", auth.UsageSourceLegacy, "", "", 30, now)
insert(db, "bob", auth.UsageSourceAPIKey, "k2", "bobk", 90, now)
buckets, totals, err := auth.GetUserUsageBySource(db, "alice", "month")
Expect(err).ToNot(HaveOccurred())
for _, b := range buckets {
Expect(b.UserID).To(Or(BeEmpty(), Equal("alice")))
Expect(b.Source).ToNot(Equal(auth.UsageSourceLegacy))
}
Expect(totals.GrandTotal.Tokens).To(Equal(int64(150)))
Expect(totals.BySource[auth.UsageSourceAPIKey].Tokens).To(Equal(int64(100)))
Expect(totals.BySource[auth.UsageSourceWeb].Tokens).To(Equal(int64(50)))
_, hasLegacy := totals.BySource[auth.UsageSourceLegacy]
Expect(hasLegacy).To(BeFalse())
})
It("snapshots survive key deletion", func() {
db := testDB()
now := time.Now()
insert(db, "alice", auth.UsageSourceAPIKey, "deleted-key", "old-name", 42, now)
_, totals, err := auth.GetUserUsageBySource(db, "alice", "month")
Expect(err).ToNot(HaveOccurred())
Expect(totals.ByKey).To(HaveLen(1))
Expect(totals.ByKey[0].APIKeyName).To(Equal("old-name"))
Expect(totals.ByKey[0].APIKeyID).To(Equal("deleted-key"))
Expect(totals.ByKey[0].LastUsed).ToNot(BeZero())
Expect(totals.ByKey[0].LastUsed).To(BeTemporally("~", now, 2*time.Second))
})
})
Describe("GetAllUsageBySource", func() {
insert := func(db *gorm.DB, userID, source, keyID string, tokens int64) {
rec := &auth.UsageRecord{
UserID: userID,
Source: source,
Model: "gpt-4",
TotalTokens: tokens,
CreatedAt: time.Now(),
}
if keyID != "" {
rec.APIKeyID = &keyID
rec.APIKeyName = "name-" + keyID
}
Expect(auth.RecordUsage(db, rec)).To(Succeed())
}
It("includes legacy for admins", func() {
db := testDB()
insert(db, "alice", auth.UsageSourceAPIKey, "k1", 10)
insert(db, "legacy-api-key", auth.UsageSourceLegacy, "", 5)
_, totals, _, err := auth.GetAllUsageBySource(db, "month", "", "")
Expect(err).ToNot(HaveOccurred())
Expect(totals.BySource).To(HaveKey(auth.UsageSourceLegacy))
Expect(totals.BySource[auth.UsageSourceLegacy].Tokens).To(Equal(int64(5)))
})
It("filters by user_id AND api_key_id", func() {
db := testDB()
insert(db, "alice", auth.UsageSourceAPIKey, "k1", 10)
insert(db, "alice", auth.UsageSourceAPIKey, "k2", 20)
insert(db, "bob", auth.UsageSourceAPIKey, "k3", 30)
_, totals, _, err := auth.GetAllUsageBySource(db, "month", "alice", "k2")
Expect(err).ToNot(HaveOccurred())
Expect(totals.GrandTotal.Tokens).To(Equal(int64(20)))
})
It("sets truncated=true when by_key exceeds the cap", func() {
db := testDB()
for i := 0; i < 210; i++ {
insert(db, "alice", auth.UsageSourceAPIKey, fmt.Sprintf("key-%03d", i), int64(210-i))
}
_, totals, truncated, err := auth.GetAllUsageBySource(db, "month", "", "")
Expect(err).ToNot(HaveOccurred())
Expect(truncated).To(BeTrue())
Expect(totals.ByKey).To(HaveLen(200))
Expect(totals.ByKey[0].Tokens > totals.ByKey[199].Tokens).To(BeTrue())
})
// insertNamed records a row with explicit user_id, user_name, source,
// and optional api key snapshot. Used by the user-attribution tests
// below which the older insert helper can't express.
insertNamed := func(db *gorm.DB, userID, userName, source, keyID, keyName string, tokens int64) {
rec := &auth.UsageRecord{
UserID: userID,
UserName: userName,
Source: source,
Model: "gpt-4",
TotalTokens: tokens,
CreatedAt: time.Now(),
}
if keyID != "" {
rec.APIKeyID = &keyID
rec.APIKeyName = keyName
}
Expect(auth.RecordUsage(db, rec)).To(Succeed())
}
It("attributes each KeyTotal to its owner user", func() {
db := testDB()
insertNamed(db, "alice", "Alice", auth.UsageSourceAPIKey, "k1", "ci-runner", 100)
insertNamed(db, "bob", "Bob", auth.UsageSourceAPIKey, "k2", "lap", 50)
_, totals, _, err := auth.GetAllUsageBySource(db, "month", "", "")
Expect(err).ToNot(HaveOccurred())
Expect(totals.ByKey).To(HaveLen(2))
byID := map[string]auth.KeyTotal{}
for _, k := range totals.ByKey {
byID[k.APIKeyID] = k
}
Expect(byID["k1"].UserID).To(Equal("alice"))
Expect(byID["k1"].UserName).To(Equal("Alice"))
Expect(byID["k2"].UserID).To(Equal("bob"))
Expect(byID["k2"].UserName).To(Equal("Bob"))
})
It("breaks Web UI and legacy traffic out per user in by_user_source for admin", func() {
db := testDB()
// Alice and Bob both have Web UI traffic; a synthetic legacy user
// also contributes. ByUserSource should expose one row per
// (source, user) pair, never for source=apikey.
insertNamed(db, "alice", "Alice", auth.UsageSourceWeb, "", "", 30)
insertNamed(db, "bob", "Bob", auth.UsageSourceWeb, "", "", 70)
insertNamed(db, "legacy-api-key", "API Key User", auth.UsageSourceLegacy, "", "", 10)
insertNamed(db, "alice", "Alice", auth.UsageSourceAPIKey, "k1", "ci-runner", 5)
_, totals, _, err := auth.GetAllUsageBySource(db, "month", "", "")
Expect(err).ToNot(HaveOccurred())
Expect(totals.ByUserSource).ToNot(BeEmpty())
for _, r := range totals.ByUserSource {
Expect(r.Source).ToNot(Equal(auth.UsageSourceAPIKey))
}
webByUser := map[string]int64{}
legacyByUser := map[string]int64{}
for _, r := range totals.ByUserSource {
switch r.Source {
case auth.UsageSourceWeb:
webByUser[r.UserID] = r.Tokens
case auth.UsageSourceLegacy:
legacyByUser[r.UserID] = r.Tokens
}
}
Expect(webByUser["alice"]).To(Equal(int64(30)))
Expect(webByUser["bob"]).To(Equal(int64(70)))
Expect(legacyByUser["legacy-api-key"]).To(Equal(int64(10)))
})
It("does NOT populate by_user_source in the non-admin path", func() {
db := testDB()
insertNamed(db, "alice", "Alice", auth.UsageSourceWeb, "", "", 30)
_, totals, err := auth.GetUserUsageBySource(db, "alice", "month")
Expect(err).ToNot(HaveOccurred())
// Non-admin path uses includeLegacy=false, so by_user_source stays nil.
Expect(totals.ByUserSource).To(BeNil())
})
})
})

View File

@@ -16,8 +16,11 @@ import (
"github.com/google/uuid"
"github.com/gorilla/websocket"
"github.com/labstack/echo/v4"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/gallery"
"github.com/mudler/LocalAI/core/http/auth"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/core/services/galleryop"
"github.com/mudler/LocalAI/core/services/nodes"
"github.com/mudler/xlog"
"gorm.io/gorm"
@@ -381,14 +384,24 @@ func ResumeNodeEndpoint(registry *nodes.NodeRegistry) echo.HandlerFunc {
}
}
// InstallBackendOnNodeEndpoint triggers backend installation on a worker node via NATS.
// InstallBackendOnNodeEndpoint triggers backend installation on a worker node.
// Async: enqueues a ManagementOp on the gallery service channel and returns a
// jobID immediately. The gallery service worker goroutine drives the actual
// install via DistributedBackendManager.InstallBackend, which honors the op's
// TargetNodeID to scope the fan-out to one node. The UI polls /api/backends/job/:uid
// for progress, mirroring /api/backends/install/:id.
//
// Backend can be either a gallery ID (resolved against BackendGalleries) or a
// direct URI install (URI + Name + optional Alias) same shape as the
// direct URI install (URI + Name + optional Alias) - same shape as the
// standalone /api/backends/install-external path, just scoped to one node.
func InstallBackendOnNodeEndpoint(unloader nodes.NodeCommandSender) echo.HandlerFunc {
//
// The legacy unloader argument is retained for signature symmetry with
// DeleteBackendOnNodeEndpoint / ListBackendsOnNodeEndpoint but is no longer
// used here - the async path goes through galleryService.
func InstallBackendOnNodeEndpoint(_ nodes.NodeCommandSender, galleryService *galleryop.GalleryService, opcache *galleryop.OpCache, appConfig *config.ApplicationConfig) echo.HandlerFunc {
return func(c echo.Context) error {
if unloader == nil {
return c.JSON(http.StatusServiceUnavailable, nodeError(http.StatusServiceUnavailable, "NATS not configured"))
if galleryService == nil {
return c.JSON(http.StatusServiceUnavailable, nodeError(http.StatusServiceUnavailable, "gallery service not configured"))
}
nodeID := c.Param("id")
var req struct {
@@ -401,25 +414,65 @@ func InstallBackendOnNodeEndpoint(unloader nodes.NodeCommandSender) echo.Handler
if err := c.Bind(&req); err != nil {
return c.JSON(http.StatusBadRequest, nodeError(http.StatusBadRequest, "invalid request body"))
}
// Either a gallery backend name or a direct URI must be supplied.
if req.Backend == "" && req.URI == "" {
return c.JSON(http.StatusBadRequest, nodeError(http.StatusBadRequest, "backend name or uri required"))
}
// Admin-driven backend install: not tied to a specific replica slot
// (no model is being loaded). Pass replica 0 to match the worker's
// admin process-key convention (`backend#0`). The worker's fast path
// takes over if the backend is already running — upgrades go through
// the dedicated /api/backends/upgrade path on backend.upgrade.
reply, err := unloader.InstallBackend(nodeID, req.Backend, "", req.BackendGalleries, req.URI, req.Name, req.Alias, 0)
jobUUID, err := uuid.NewUUID()
if err != nil {
xlog.Error("Failed to install backend on node", "node", nodeID, "backend", req.Backend, "uri", req.URI, "error", err)
return c.JSON(http.StatusInternalServerError, nodeError(http.StatusInternalServerError, "failed to install backend on node"))
return c.JSON(http.StatusInternalServerError, nodeError(http.StatusInternalServerError, "failed to generate job id"))
}
if !reply.Success {
xlog.Error("Backend install failed on node", "node", nodeID, "backend", req.Backend, "uri", req.URI, "error", reply.Error)
return c.JSON(http.StatusInternalServerError, nodeError(http.StatusInternalServerError, "backend installation failed"))
jobID := jobUUID.String()
// Cache key: for gallery installs, use the backend slug; for URI
// installs prefer the provided Name (falling back to URI). All keys
// are node-scoped so concurrent installs of the same backend on
// different nodes do not stomp each other in opcache.
backendKey := req.Backend
if backendKey == "" {
backendKey = req.Name
if backendKey == "" {
backendKey = req.URI
}
}
return c.JSON(http.StatusOK, map[string]string{"message": "backend installed"})
cacheKey := galleryop.NodeScopedKey(nodeID, backendKey)
opcache.SetBackend(cacheKey, jobID)
// Optional caller-supplied galleries override. Mirrors the standalone
// install path so an admin can point at a private gallery.
galleries := appConfig.BackendGalleries
if req.BackendGalleries != "" {
var custom []config.Gallery
if err := json.Unmarshal([]byte(req.BackendGalleries), &custom); err != nil {
xlog.Warn("Ignoring malformed backend_galleries override; falling back to configured galleries", "error", err, "nodeID", nodeID)
} else if len(custom) > 0 {
galleries = custom
}
}
ctx, cancelFunc := context.WithCancel(context.Background())
op := galleryop.ManagementOp[gallery.GalleryBackend, any]{
ID: jobID,
GalleryElementName: req.Backend,
Galleries: galleries,
TargetNodeID: nodeID,
ExternalURI: req.URI,
ExternalName: req.Name,
ExternalAlias: req.Alias,
Context: ctx,
CancelFunc: cancelFunc,
}
galleryService.StoreCancellation(jobID, cancelFunc)
go func() {
galleryService.BackendGalleryChannel <- op
}()
xlog.Info("Node-scoped backend install dispatched", "node", nodeID, "backend", req.Backend, "uri", req.URI, "jobID", jobID)
return c.JSON(http.StatusAccepted, map[string]string{
"jobID": jobID,
"statusUrl": "/api/backends/job/" + jobID,
"message": "backend installation started",
})
}
}

View File

@@ -0,0 +1,123 @@
package localai_test
import (
"bytes"
"encoding/json"
"net/http"
"net/http/httptest"
"github.com/labstack/echo/v4"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/gallery"
"github.com/mudler/LocalAI/core/http/endpoints/localai"
"github.com/mudler/LocalAI/core/services/galleryop"
)
// InstallBackendOnNodeEndpoint became async to stop blocking the browser on
// the 3-minute NATS reply timeout. These specs lock in the new contract:
// HTTP 202 with a jobID, a ManagementOp enqueued on the gallery channel, and
// an opcache entry keyed by NodeScopedKey so concurrent installs of the same
// backend on different nodes do not stomp each other.
var _ = Describe("InstallBackendOnNodeEndpoint async behavior", func() {
var (
e *echo.Echo
galleryService *galleryop.GalleryService
opcache *galleryop.OpCache
appCfg *config.ApplicationConfig
dispatched chan galleryop.ManagementOp[gallery.GalleryBackend, any]
done chan struct{}
drainExited chan struct{}
)
BeforeEach(func() {
e = echo.New()
appCfg = &config.ApplicationConfig{
BackendGalleries: []config.Gallery{{Name: "test-gallery", URL: "http://example.com"}},
}
galleryService = galleryop.NewGalleryService(appCfg, nil)
opcache = galleryop.NewOpCache(galleryService)
// Drain the gallery channel into a buffered side channel so the
// handler's `go func() { ch <- op }()` send does not block waiting
// for the real worker (which is not running in this unit test).
dispatched = make(chan galleryop.ManagementOp[gallery.GalleryBackend, any], 4)
done = make(chan struct{})
drainExited = make(chan struct{})
go func() {
defer close(drainExited)
for {
select {
case op := <-galleryService.BackendGalleryChannel:
dispatched <- op
case <-done:
return
}
}
}()
})
AfterEach(func() {
// Signal the drain goroutine to exit. We do NOT close
// BackendGalleryChannel: the handler's dispatch goroutine may still
// be pending (specs that don't Eventually-Receive), and a send on a
// closed channel panics. Signalling via `done` lets the drain
// goroutine return without touching the gallery channel.
close(done)
Eventually(drainExited, "2s").Should(BeClosed())
})
It("returns 202 with a jobID and dispatches a TargetNodeID-scoped op", func() {
body := `{"backend": "llama-cpp"}`
req := httptest.NewRequest(http.MethodPost, "/api/nodes/node-xyz/backends/install", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
rec := httptest.NewRecorder()
c := e.NewContext(req, rec)
c.SetParamNames("id")
c.SetParamValues("node-xyz")
handler := localai.InstallBackendOnNodeEndpoint(nil, galleryService, opcache, appCfg)
Expect(handler(c)).To(Succeed())
Expect(rec.Code).To(Equal(http.StatusAccepted))
var resp map[string]any
Expect(json.Unmarshal(rec.Body.Bytes(), &resp)).To(Succeed())
Expect(resp["jobID"]).To(BeAssignableToTypeOf(""))
Expect(resp["jobID"].(string)).ToNot(BeEmpty())
Expect(resp["message"]).To(Equal("backend installation started"))
Eventually(dispatched, "2s").Should(Receive())
Expect(opcache.Exists(galleryop.NodeScopedKey("node-xyz", "llama-cpp"))).To(BeTrue())
Expect(opcache.IsBackendOp(galleryop.NodeScopedKey("node-xyz", "llama-cpp"))).To(BeTrue())
})
It("returns 400 when neither backend nor uri is supplied", func() {
req := httptest.NewRequest(http.MethodPost, "/api/nodes/node-xyz/backends/install", bytes.NewBufferString(`{}`))
req.Header.Set("Content-Type", "application/json")
rec := httptest.NewRecorder()
c := e.NewContext(req, rec)
c.SetParamNames("id")
c.SetParamValues("node-xyz")
handler := localai.InstallBackendOnNodeEndpoint(nil, galleryService, opcache, appCfg)
Expect(handler(c)).To(Succeed())
Expect(rec.Code).To(Equal(http.StatusBadRequest))
})
It("accepts a direct URI install and uses the name as the cache key", func() {
body := `{"uri": "oci://example.com/custom-backend:v1", "name": "custom"}`
req := httptest.NewRequest(http.MethodPost, "/api/nodes/node-xyz/backends/install", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
rec := httptest.NewRecorder()
c := e.NewContext(req, rec)
c.SetParamNames("id")
c.SetParamValues("node-xyz")
handler := localai.InstallBackendOnNodeEndpoint(nil, galleryService, opcache, appCfg)
Expect(handler(c)).To(Succeed())
Expect(rec.Code).To(Equal(http.StatusAccepted))
Expect(opcache.Exists(galleryop.NodeScopedKey("node-xyz", "custom"))).To(BeTrue())
})
})

View File

@@ -73,363 +73,6 @@ func mergeToolCallDeltas(existing []schema.ToolCall, deltas []schema.ToolCall) [
// @Success 200 {object} schema.OpenAIResponse "Response"
// @Router /v1/chat/completions [post]
func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator *templates.Evaluator, startupOptions *config.ApplicationConfig, natsClient mcpTools.MCPNATSClient, assistantHolder *mcpTools.LocalAIAssistantHolder) echo.HandlerFunc {
process := func(s string, req *schema.OpenAIRequest, config *config.ModelConfig, loader *model.ModelLoader, responses chan schema.OpenAIResponse, extraUsage bool, id string, created int) error {
initialMessage := schema.OpenAIResponse{
ID: id,
Created: created,
Model: req.Model, // we have to return what the user sent here, due to OpenAI spec.
Choices: []schema.Choice{{Delta: &schema.Message{Role: "assistant"}, Index: 0, FinishReason: nil}},
Object: "chat.completion.chunk",
}
responses <- initialMessage
// Detect if thinking token is already in prompt or template
// When UseTokenizerTemplate is enabled, predInput is empty, so we check the template
var template string
if config.TemplateConfig.UseTokenizerTemplate {
template = config.GetModelTemplate()
} else {
template = s
}
thinkingStartToken := reason.DetectThinkingStartToken(template, &config.ReasoningConfig)
extractor := reason.NewReasoningExtractor(thinkingStartToken, config.ReasoningConfig)
_, _, _, err := ComputeChoices(req, s, config, cl, startupOptions, loader, func(s string, c *[]schema.Choice) {}, func(s string, tokenUsage backend.TokenUsage) bool {
var reasoningDelta, contentDelta string
// Always keep the Go-side extractor in sync with raw tokens so it
// can serve as fallback for backends without an autoparser (e.g. vLLM).
goReasoning, goContent := extractor.ProcessToken(s)
// When C++ autoparser chat deltas are available, prefer them — they
// handle model-specific formats (Gemma 4, etc.) without Go-side tags.
// Otherwise fall back to Go-side extraction.
if tokenUsage.HasChatDeltaContent() {
rawReasoning, cd := tokenUsage.ChatDeltaReasoningAndContent()
contentDelta = cd
reasoningDelta = extractor.ProcessChatDeltaReasoning(rawReasoning)
} else {
reasoningDelta = goReasoning
contentDelta = goContent
}
usage := schema.OpenAIUsage{
PromptTokens: tokenUsage.Prompt,
CompletionTokens: tokenUsage.Completion,
TotalTokens: tokenUsage.Prompt + tokenUsage.Completion,
}
if extraUsage {
usage.TimingTokenGeneration = tokenUsage.TimingTokenGeneration
usage.TimingPromptProcessing = tokenUsage.TimingPromptProcessing
}
delta := &schema.Message{}
if contentDelta != "" {
delta.Content = &contentDelta
}
if reasoningDelta != "" {
delta.Reasoning = &reasoningDelta
}
// Usage rides as a struct field for the consumer to track the
// running cumulative — it is stripped before JSON marshal so the
// wire chunk stays spec-compliant (no `usage` on intermediate
// chunks). The dedicated trailer chunk (when include_usage=true)
// carries the final totals.
usageForChunk := usage
resp := schema.OpenAIResponse{
ID: id,
Created: created,
Model: req.Model, // we have to return what the user sent here, due to OpenAI spec.
Choices: []schema.Choice{{Delta: delta, Index: 0, FinishReason: nil}},
Object: "chat.completion.chunk",
Usage: &usageForChunk,
}
responses <- resp
return true
})
close(responses)
return err
}
processTools := func(noAction string, prompt string, req *schema.OpenAIRequest, config *config.ModelConfig, loader *model.ModelLoader, responses chan schema.OpenAIResponse, extraUsage bool, id string, created int, textContentToReturn *string) error {
// Detect if thinking token is already in prompt or template
var template string
if config.TemplateConfig.UseTokenizerTemplate {
template = config.GetModelTemplate()
} else {
template = prompt
}
thinkingStartToken := reason.DetectThinkingStartToken(template, &config.ReasoningConfig)
extractor := reason.NewReasoningExtractor(thinkingStartToken, config.ReasoningConfig)
result := ""
lastEmittedCount := 0
sentInitialRole := false
sentReasoning := false
hasChatDeltaToolCalls := false
hasChatDeltaContent := false
_, _, chatDeltas, err := ComputeChoices(req, prompt, config, cl, startupOptions, loader, func(s string, c *[]schema.Choice) {}, func(s string, usage backend.TokenUsage) bool {
result += s
// Track whether ChatDeltas from the C++ autoparser contain
// tool calls or content, so the retry decision can account for them.
for _, d := range usage.ChatDeltas {
if len(d.ToolCalls) > 0 {
hasChatDeltaToolCalls = true
}
if d.Content != "" {
hasChatDeltaContent = true
}
}
var reasoningDelta, contentDelta string
goReasoning, goContent := extractor.ProcessToken(s)
if usage.HasChatDeltaContent() {
rawReasoning, cd := usage.ChatDeltaReasoningAndContent()
contentDelta = cd
reasoningDelta = extractor.ProcessChatDeltaReasoning(rawReasoning)
} else {
reasoningDelta = goReasoning
contentDelta = goContent
}
// Emit reasoning deltas in their own SSE chunks before any tool-call chunks
// (OpenAI spec: reasoning and tool_calls never share a delta)
if reasoningDelta != "" {
responses <- schema.OpenAIResponse{
ID: id,
Created: created,
Model: req.Model,
Choices: []schema.Choice{{
Delta: &schema.Message{Reasoning: &reasoningDelta},
Index: 0,
}},
Object: "chat.completion.chunk",
}
sentReasoning = true
}
// Stream content deltas (cleaned of reasoning tags) while no tool calls
// have been detected. Once the incremental parser finds tool calls,
// content stops — per OpenAI spec, content and tool_calls don't mix.
if lastEmittedCount == 0 && contentDelta != "" {
if !sentInitialRole {
responses <- schema.OpenAIResponse{
ID: id, Created: created, Model: req.Model,
Choices: []schema.Choice{{Delta: &schema.Message{Role: "assistant"}, Index: 0}},
Object: "chat.completion.chunk",
}
sentInitialRole = true
}
responses <- schema.OpenAIResponse{
ID: id, Created: created, Model: req.Model,
Choices: []schema.Choice{{
Delta: &schema.Message{Content: &contentDelta},
Index: 0,
}},
Object: "chat.completion.chunk",
}
}
// Try incremental XML parsing for streaming support using iterative parser
// This allows emitting partial tool calls as they're being generated
cleanedResult := functions.CleanupLLMResult(result, config.FunctionsConfig)
// Determine XML format from config
var xmlFormat *functions.XMLToolCallFormat
if config.FunctionsConfig.XMLFormat != nil {
xmlFormat = config.FunctionsConfig.XMLFormat
} else if config.FunctionsConfig.XMLFormatPreset != "" {
xmlFormat = functions.GetXMLFormatPreset(config.FunctionsConfig.XMLFormatPreset)
}
// Use iterative parser for streaming (partial parsing enabled)
// Try XML parsing first
partialResults, parseErr := functions.ParseXMLIterative(cleanedResult, xmlFormat, true)
if parseErr == nil && len(partialResults) > 0 {
// Emit new XML tool calls that weren't emitted before
if len(partialResults) > lastEmittedCount {
for i := lastEmittedCount; i < len(partialResults); i++ {
toolCall := partialResults[i]
initialMessage := schema.OpenAIResponse{
ID: id,
Created: created,
Model: req.Model,
Choices: []schema.Choice{{
Delta: &schema.Message{
Role: "assistant",
ToolCalls: []schema.ToolCall{
{
Index: i,
ID: id,
Type: "function",
FunctionCall: schema.FunctionCall{
Name: toolCall.Name,
},
},
},
},
Index: 0,
FinishReason: nil,
}},
Object: "chat.completion.chunk",
}
select {
case responses <- initialMessage:
default:
}
}
lastEmittedCount = len(partialResults)
}
} else {
// Try JSON tool call parsing for streaming.
// Only emit NEW tool calls (same guard as XML parser above).
jsonResults, jsonErr := functions.ParseJSONIterative(cleanedResult, true)
if jsonErr == nil && len(jsonResults) > lastEmittedCount {
for i := lastEmittedCount; i < len(jsonResults); i++ {
jsonObj := jsonResults[i]
name, ok := jsonObj["name"].(string)
if !ok || name == "" {
continue
}
args := "{}"
if argsVal, ok := jsonObj["arguments"]; ok {
if argsStr, ok := argsVal.(string); ok {
args = argsStr
} else {
argsBytes, _ := json.Marshal(argsVal)
args = string(argsBytes)
}
}
initialMessage := schema.OpenAIResponse{
ID: id,
Created: created,
Model: req.Model,
Choices: []schema.Choice{{
Delta: &schema.Message{
Role: "assistant",
ToolCalls: []schema.ToolCall{
{
Index: i,
ID: id,
Type: "function",
FunctionCall: schema.FunctionCall{
Name: name,
Arguments: args,
},
},
},
},
Index: 0,
FinishReason: nil,
}},
Object: "chat.completion.chunk",
}
responses <- initialMessage
}
lastEmittedCount = len(jsonResults)
}
}
return true
},
func(attempt int) bool {
// After streaming completes: check if we got actionable content
cleaned := extractor.CleanedContent()
// Check for tool calls from chat deltas (will be re-checked after ComputeChoices,
// but we need to know here whether to retry).
// Also check ChatDelta flags — when the C++ autoparser is active,
// tool calls and content are delivered via ChatDeltas while the
// raw message is cleared. Without this check, we'd retry
// unnecessarily, losing valid results and concatenating output.
hasToolCalls := lastEmittedCount > 0 || hasChatDeltaToolCalls
hasContent := cleaned != "" || hasChatDeltaContent
if !hasContent && !hasToolCalls {
xlog.Warn("Streaming: backend produced only reasoning, retrying",
"reasoning_len", len(extractor.Reasoning()), "attempt", attempt+1)
extractor.ResetAndSuppressReasoning()
result = ""
lastEmittedCount = 0
sentInitialRole = false
hasChatDeltaToolCalls = false
hasChatDeltaContent = false
return true
}
return false
},
)
if err != nil {
return err
}
// Try using pre-parsed tool calls from C++ autoparser (chat deltas)
var functionResults []functions.FuncCallResults
var reasoning string
if deltaToolCalls := functions.ToolCallsFromChatDeltas(chatDeltas); len(deltaToolCalls) > 0 {
xlog.Debug("[ChatDeltas] Using pre-parsed tool calls from C++ autoparser", "count", len(deltaToolCalls))
functionResults = deltaToolCalls
// Use content/reasoning from deltas too
*textContentToReturn = functions.ContentFromChatDeltas(chatDeltas)
reasoning = functions.ReasoningFromChatDeltas(chatDeltas)
} else {
// Fallback: parse tool calls from raw text (no chat deltas from backend)
xlog.Debug("[ChatDeltas] no pre-parsed tool calls, falling back to Go-side text parsing")
reasoning = extractor.Reasoning()
cleanedResult := extractor.CleanedContent()
*textContentToReturn = functions.ParseTextContent(cleanedResult, config.FunctionsConfig)
cleanedResult = functions.CleanupLLMResult(cleanedResult, config.FunctionsConfig)
functionResults = functions.ParseFunctionCall(cleanedResult, config.FunctionsConfig)
}
xlog.Debug("[ChatDeltas] final tool call decision", "tool_calls", len(functionResults), "text_content", *textContentToReturn)
// noAction is a sentinel "just answer" pseudo-function — not a real
// tool call. Scan the whole slice rather than only index 0 so we
// don't drop a real tool call that happens to follow a noAction
// entry, and so the default branch isn't entered with only noAction
// entries to emit as tool_calls.
noActionToRun := !hasRealCall(functionResults, noAction)
switch {
case noActionToRun:
// Token-cumulative usage is communicated to the streaming
// consumer via the per-token callback's chunk struct (stripped
// before wire marshal). The final usage trailer — when the
// caller opted in with stream_options.include_usage — is built
// by the outer streaming loop, not here.
var result string
if !sentInitialRole {
var hqErr error
result, hqErr = handleQuestion(config, functionResults, extractor.CleanedContent(), prompt)
if hqErr != nil {
xlog.Error("error handling question", "error", hqErr)
return hqErr
}
}
for _, chunk := range buildNoActionFinalChunks(
id, req.Model, created,
sentInitialRole, sentReasoning,
result, reasoning,
) {
responses <- chunk
}
default:
for _, chunk := range buildDeferredToolCallChunks(
id, req.Model, created,
functionResults, lastEmittedCount,
sentInitialRole, *textContentToReturn,
sentReasoning, reasoning,
) {
responses <- chunk
}
}
close(responses)
return err
}
return func(c echo.Context) error {
var textContentToReturn string
id := uuid.New().String()
@@ -697,17 +340,19 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
}
responses := make(chan schema.OpenAIResponse)
ended := make(chan error, 1)
ended := make(chan streamWorkerResult, 1)
go func() {
if !shouldUseFn {
ended <- process(predInput, input, config, ml, responses, extraUsage, id, created)
u, err := processStream(predInput, input, config, cl, startupOptions, ml, responses, id, created)
ended <- streamWorkerResult{usage: u, err: err}
} else {
ended <- processTools(noActionName, predInput, input, config, ml, responses, extraUsage, id, created, &textContentToReturn)
u, err := processStreamWithTools(noActionName, predInput, input, config, cl, startupOptions, ml, responses, id, created, &textContentToReturn)
ended <- streamWorkerResult{usage: u, err: err}
}
}()
usage := &schema.OpenAIUsage{}
var finalUsage backend.TokenUsage
toolsCalled := false
var collectedToolCalls []schema.ToolCall
var collectedContent string
@@ -725,13 +370,6 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
xlog.Debug("No choices in the response, skipping")
continue
}
// Capture the running cumulative usage from this chunk
// (when present) so the include_usage trailer can carry
// the final totals. Usage is stripped before marshal
// below so the wire chunk stays spec-compliant.
if ev.Usage != nil {
usage = ev.Usage
}
if len(ev.Choices[0].Delta.ToolCalls) > 0 {
toolsCalled = true
// Collect and merge tool call deltas for MCP execution
@@ -747,11 +385,6 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
collectedContent += *sp
}
}
// OpenAI streaming spec: intermediate chunks must NOT
// carry a `usage` field. Strip the tracking copy
// before marshalling — usage is delivered via the
// dedicated trailer chunk when include_usage=true.
ev.Usage = nil
respData, err := json.Marshal(ev)
if err != nil {
xlog.Debug("Failed to marshal response", "error", err)
@@ -766,15 +399,16 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
return err
}
c.Response().Flush()
case err := <-ended:
if err == nil {
case res := <-ended:
if res.err == nil {
finalUsage = res.usage
break LOOP
}
xlog.Error("Stream ended with error", "error", err)
xlog.Error("Stream ended with error", "error", res.err)
errorResp := schema.ErrorResponse{
Error: &schema.APIError{
Message: err.Error(),
Message: res.err.Error(),
Type: "server_error",
Code: "server_error",
},
@@ -797,7 +431,10 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
// still trying to send (e.g., after client disconnect). The goroutine
// calls close(responses) when done, which terminates the drain.
if input.Context.Err() != nil {
go func() { for range responses {} }()
go func() {
for range responses {
}
}()
<-ended
}
@@ -921,8 +558,16 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
// Trailing usage chunk per OpenAI spec: emit only when the
// caller opted in via stream_options.include_usage. Shape:
// {"choices":[],"usage":{...},"object":"chat.completion.chunk",...}
if input.StreamOptions != nil && input.StreamOptions.IncludeUsage && usage != nil {
trailer := streamUsageTrailerJSON(id, input.Model, created, *usage)
//
// finalUsage is the authoritative TokenUsage returned by the
// worker function (process / processTools) via the `ended`
// channel. The worker reads it from ComputeChoices' return
// value, which is the cumulative count produced by the backend
// over the whole prediction. Issue #9927 was caused by the
// tools-path worker not surfacing this value at all.
if input.StreamOptions != nil && input.StreamOptions.IncludeUsage {
trailerUsage := streamUsageFromTokenUsage(finalUsage, extraUsage)
trailer := streamUsageTrailerJSON(id, input.Model, created, trailerUsage)
_, _ = fmt.Fprintf(c.Response().Writer, "data: %s\n\n", trailer)
}

View File

@@ -4,10 +4,39 @@ import (
"encoding/json"
"fmt"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/pkg/functions"
)
// streamWorkerResult is what the streaming workers (process / processTools)
// hand back to the outer ChatEndpoint loop through the `ended` channel.
// Threading the final TokenUsage here, instead of piggy-backing it on the
// `responses` SSE channel, keeps the SSE channel single-purpose (wire chunks)
// and gives the trailer emitter a plain Go value to read after LOOP exits.
// Fix for issue #9927: the previous tools-path worker never surfaced the
// cumulative token counts at all, so the include_usage trailer reported zeros.
type streamWorkerResult struct {
usage backend.TokenUsage
err error
}
// streamUsageFromTokenUsage converts the backend's cumulative TokenUsage into
// the OpenAI-spec OpenAIUsage shape used on the wire. `extraUsage` controls
// whether the non-standard timing fields are forwarded.
func streamUsageFromTokenUsage(usage backend.TokenUsage, extraUsage bool) schema.OpenAIUsage {
out := schema.OpenAIUsage{
PromptTokens: usage.Prompt,
CompletionTokens: usage.Completion,
TotalTokens: usage.Prompt + usage.Completion,
}
if extraUsage {
out.TimingTokenGeneration = usage.TimingTokenGeneration
out.TimingPromptProcessing = usage.TimingPromptProcessing
}
return out
}
// streamUsageTrailerJSON returns the bytes of the OpenAI-spec trailing usage
// chunk emitted in streaming completions when the request opts in via
// `stream_options.include_usage: true`. The shape is:

View File

@@ -1,10 +1,14 @@
package openai
import (
"context"
"encoding/json"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/pkg/functions"
"github.com/mudler/LocalAI/pkg/model"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
@@ -152,6 +156,28 @@ var _ = Describe("streaming usage spec compliance", func() {
})
})
Describe("streamUsageFromTokenUsage", func() {
It("converts backend TokenUsage to schema OpenAIUsage", func() {
tu := backend.TokenUsage{Prompt: 18, Completion: 213}
u := streamUsageFromTokenUsage(tu, false)
Expect(u.PromptTokens).To(Equal(18))
Expect(u.CompletionTokens).To(Equal(213))
Expect(u.TotalTokens).To(Equal(231))
Expect(u.TimingTokenGeneration).To(BeZero())
Expect(u.TimingPromptProcessing).To(BeZero())
})
It("includes timings when extraUsage is true", func() {
tu := backend.TokenUsage{
Prompt: 10, Completion: 20,
TimingPromptProcessing: 0.5,
TimingTokenGeneration: 1.5,
}
u := streamUsageFromTokenUsage(tu, true)
Expect(u.TimingPromptProcessing).To(Equal(0.5))
Expect(u.TimingTokenGeneration).To(Equal(1.5))
})
})
Describe("OpenAIRequest.StreamOptions", func() {
It("parses stream_options.include_usage=true", func() {
body := []byte(`{
@@ -177,3 +203,160 @@ var _ = Describe("streaming usage spec compliance", func() {
})
})
})
// Functional regression coverage for issue #9927: the streaming workers
// must surface the cumulative TokenUsage returned by ComputeChoices to
// their caller. The earlier broken implementations discarded that value
// (`_, _, chatDeltas, err := ComputeChoices(...)`) and threw away the
// counts on the floor, so the include_usage trailer always reported
// zeros when tools were enabled.
//
// These tests stub backend.ModelInferenceFunc so the worker exercises the
// real ComputeChoices → predFunc → LLMResponse pipeline. If a future change
// drops the TokenUsage somewhere along that path, the assertions on the
// returned value fail with a concrete count mismatch (e.g. 0 vs 213),
// not with a "function undefined" compile error.
var _ = Describe("streaming workers surface final TokenUsage (issue #9927)", func() {
var (
origInference modelInferenceFunc
appCfg *config.ApplicationConfig
)
BeforeEach(func() {
origInference = backend.ModelInferenceFunc
appCfg = config.NewApplicationConfig()
})
AfterEach(func() {
backend.ModelInferenceFunc = origInference
})
// mockBackendUsage installs a stub backend that yields one LLMResponse
// carrying the supplied TokenUsage. ComputeChoices' single-attempt path
// copies these counts into the value it returns to the worker.
mockBackendUsage := func(usage backend.TokenUsage, response string) {
backend.ModelInferenceFunc = func(
ctx context.Context, s string, messages schema.Messages,
images, videos, audios []string,
loader *model.ModelLoader, c *config.ModelConfig, cl *config.ModelConfigLoader,
o *config.ApplicationConfig,
tokenCallback func(string, backend.TokenUsage) bool,
tools, toolChoice string,
logprobs, topLogprobs *int,
logitBias map[string]float64,
metadata map[string]string,
) (func() (backend.LLMResponse, error), error) {
return func() (backend.LLMResponse, error) {
return backend.LLMResponse{
Response: response,
Usage: usage,
}, nil
}, nil
}
}
makeReq := func() *schema.OpenAIRequest {
ctx, cancel := context.WithCancel(context.Background())
req := &schema.OpenAIRequest{
Context: ctx,
Cancel: cancel,
}
req.Model = "test-model" // promoted from BasicModelRequest
return req
}
// drainResponses consumes everything the worker pushes onto the channel
// so the worker is never blocked on its send. The channel is unbuffered
// (matching production), so the drain goroutine must be running before
// the worker is called.
drainResponses := func(ch <-chan schema.OpenAIResponse) <-chan struct{} {
done := make(chan struct{})
go func() {
for range ch {
}
close(done)
}()
return done
}
Describe("processStream (no-tools path)", func() {
It("returns the cumulative TokenUsage produced by the backend", func() {
mockBackendUsage(backend.TokenUsage{Prompt: 18, Completion: 213}, "Hello there")
req := makeReq()
cfg := &config.ModelConfig{}
responses := make(chan schema.OpenAIResponse)
done := drainResponses(responses)
actual, err := processStream("prompt", req, cfg, nil, appCfg, nil, responses, "req-1", 0)
<-done
Expect(err).ToNot(HaveOccurred())
Expect(actual.Prompt).To(Equal(18),
"prompt tokens must round-trip from backend through processStream")
Expect(actual.Completion).To(Equal(213),
"completion tokens must round-trip from backend through processStream")
})
It("returns zero TokenUsage when the backend reports zero (negative control)", func() {
mockBackendUsage(backend.TokenUsage{}, "x")
req := makeReq()
cfg := &config.ModelConfig{}
responses := make(chan schema.OpenAIResponse)
done := drainResponses(responses)
actual, err := processStream("prompt", req, cfg, nil, appCfg, nil, responses, "req-1", 0)
<-done
Expect(err).ToNot(HaveOccurred())
Expect(actual.Prompt).To(BeZero())
Expect(actual.Completion).To(BeZero())
})
})
Describe("processStreamWithTools (tools path)", func() {
It("returns the cumulative TokenUsage produced by the backend", func() {
// This is the direct regression check for issue #9927: with tools
// enabled, the trailer was reporting {0,0,0} because the worker
// discarded ComputeChoices' second return value.
mockBackendUsage(backend.TokenUsage{Prompt: 18, Completion: 213}, "answer")
req := makeReq()
cfg := &config.ModelConfig{}
responses := make(chan schema.OpenAIResponse)
done := drainResponses(responses)
var textContent string
actual, err := processStreamWithTools("none", "prompt", req, cfg, nil, appCfg, nil, responses, "req-1", 0, &textContent)
<-done
Expect(err).ToNot(HaveOccurred())
Expect(actual.Prompt).To(Equal(18),
"prompt tokens must round-trip from backend through processStreamWithTools (issue #9927)")
Expect(actual.Completion).To(Equal(213),
"completion tokens must round-trip from backend through processStreamWithTools (issue #9927)")
})
It("forwards timing fields when the backend supplies them", func() {
mockBackendUsage(backend.TokenUsage{
Prompt: 10, Completion: 20,
TimingPromptProcessing: 0.5,
TimingTokenGeneration: 1.5,
}, "answer")
req := makeReq()
cfg := &config.ModelConfig{}
responses := make(chan schema.OpenAIResponse)
done := drainResponses(responses)
var textContent string
actual, err := processStreamWithTools("none", "prompt", req, cfg, nil, appCfg, nil, responses, "req-1", 0, &textContent)
<-done
Expect(err).ToNot(HaveOccurred())
Expect(actual.TimingPromptProcessing).To(Equal(0.5))
Expect(actual.TimingTokenGeneration).To(Equal(1.5))
})
})
})

View File

@@ -0,0 +1,390 @@
package openai
import (
"encoding/json"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/pkg/functions"
"github.com/mudler/LocalAI/pkg/model"
reason "github.com/mudler/LocalAI/pkg/reasoning"
"github.com/mudler/xlog"
)
// processStream is the streaming worker for chat completions with no
// tool/function calling involved. It pushes SSE-shaped chunks onto
// `responses` and returns the authoritative cumulative TokenUsage from
// the prediction so the caller can populate the include_usage trailer
// without having to peek inside the chunks.
//
// The caller owns the `responses` channel and is expected to read from
// it while this function runs; processStream closes the channel before
// returning.
func processStream(
s string,
req *schema.OpenAIRequest,
cfg *config.ModelConfig,
cl *config.ModelConfigLoader,
startupOptions *config.ApplicationConfig,
loader *model.ModelLoader,
responses chan schema.OpenAIResponse,
id string,
created int,
) (backend.TokenUsage, error) {
responses <- schema.OpenAIResponse{
ID: id,
Created: created,
Model: req.Model, // we have to return what the user sent here, due to OpenAI spec.
Choices: []schema.Choice{{Delta: &schema.Message{Role: "assistant"}, Index: 0, FinishReason: nil}},
Object: "chat.completion.chunk",
}
// Detect if thinking token is already in prompt or template
// When UseTokenizerTemplate is enabled, predInput is empty, so we check the template
var template string
if cfg.TemplateConfig.UseTokenizerTemplate {
template = cfg.GetModelTemplate()
} else {
template = s
}
thinkingStartToken := reason.DetectThinkingStartToken(template, &cfg.ReasoningConfig)
extractor := reason.NewReasoningExtractor(thinkingStartToken, cfg.ReasoningConfig)
_, finalUsage, _, err := ComputeChoices(req, s, cfg, cl, startupOptions, loader, func(s string, c *[]schema.Choice) {}, func(s string, tokenUsage backend.TokenUsage) bool {
var reasoningDelta, contentDelta string
// Always keep the Go-side extractor in sync with raw tokens so it
// can serve as fallback for backends without an autoparser (e.g. vLLM).
goReasoning, goContent := extractor.ProcessToken(s)
// When C++ autoparser chat deltas are available, prefer them: they
// handle model-specific formats (Gemma 4, etc.) without Go-side tags.
// Otherwise fall back to Go-side extraction.
if tokenUsage.HasChatDeltaContent() {
rawReasoning, cd := tokenUsage.ChatDeltaReasoningAndContent()
contentDelta = cd
reasoningDelta = extractor.ProcessChatDeltaReasoning(rawReasoning)
} else {
reasoningDelta = goReasoning
contentDelta = goContent
}
delta := &schema.Message{}
if contentDelta != "" {
delta.Content = &contentDelta
}
if reasoningDelta != "" {
delta.Reasoning = &reasoningDelta
}
responses <- schema.OpenAIResponse{
ID: id,
Created: created,
Model: req.Model, // we have to return what the user sent here, due to OpenAI spec.
Choices: []schema.Choice{{Delta: delta, Index: 0, FinishReason: nil}},
Object: "chat.completion.chunk",
}
return true
})
close(responses)
return finalUsage, err
}
// processStreamWithTools is the streaming worker for chat completions
// with tools / function calling. Same contract as processStream: pushes
// chunks onto `responses`, closes the channel, returns the cumulative
// TokenUsage.
//
// Returning the TokenUsage as a normal Go value (rather than smuggling
// it on a sentinel chunk) is the fix for issue #9927 — the previous
// implementation discarded the value from ComputeChoices, so the
// include_usage trailer reported zeros whenever `tools` was in play.
func processStreamWithTools(
noAction string,
prompt string,
req *schema.OpenAIRequest,
cfg *config.ModelConfig,
cl *config.ModelConfigLoader,
startupOptions *config.ApplicationConfig,
loader *model.ModelLoader,
responses chan schema.OpenAIResponse,
id string,
created int,
textContentToReturn *string,
) (backend.TokenUsage, error) {
// Detect if thinking token is already in prompt or template
var template string
if cfg.TemplateConfig.UseTokenizerTemplate {
template = cfg.GetModelTemplate()
} else {
template = prompt
}
thinkingStartToken := reason.DetectThinkingStartToken(template, &cfg.ReasoningConfig)
extractor := reason.NewReasoningExtractor(thinkingStartToken, cfg.ReasoningConfig)
result := ""
lastEmittedCount := 0
sentInitialRole := false
sentReasoning := false
hasChatDeltaToolCalls := false
hasChatDeltaContent := false
_, finalUsage, chatDeltas, err := ComputeChoices(req, prompt, cfg, cl, startupOptions, loader, func(s string, c *[]schema.Choice) {}, func(s string, usage backend.TokenUsage) bool {
result += s
// Track whether ChatDeltas from the C++ autoparser contain
// tool calls or content, so the retry decision can account for them.
for _, d := range usage.ChatDeltas {
if len(d.ToolCalls) > 0 {
hasChatDeltaToolCalls = true
}
if d.Content != "" {
hasChatDeltaContent = true
}
}
var reasoningDelta, contentDelta string
goReasoning, goContent := extractor.ProcessToken(s)
if usage.HasChatDeltaContent() {
rawReasoning, cd := usage.ChatDeltaReasoningAndContent()
contentDelta = cd
reasoningDelta = extractor.ProcessChatDeltaReasoning(rawReasoning)
} else {
reasoningDelta = goReasoning
contentDelta = goContent
}
// Emit reasoning deltas in their own SSE chunks before any tool-call chunks
// (OpenAI spec: reasoning and tool_calls never share a delta)
if reasoningDelta != "" {
responses <- schema.OpenAIResponse{
ID: id,
Created: created,
Model: req.Model,
Choices: []schema.Choice{{
Delta: &schema.Message{Reasoning: &reasoningDelta},
Index: 0,
}},
Object: "chat.completion.chunk",
}
sentReasoning = true
}
// Stream content deltas (cleaned of reasoning tags) while no tool calls
// have been detected. Once the incremental parser finds tool calls,
// content stops: per OpenAI spec, content and tool_calls don't mix.
if lastEmittedCount == 0 && contentDelta != "" {
if !sentInitialRole {
responses <- schema.OpenAIResponse{
ID: id, Created: created, Model: req.Model,
Choices: []schema.Choice{{Delta: &schema.Message{Role: "assistant"}, Index: 0}},
Object: "chat.completion.chunk",
}
sentInitialRole = true
}
responses <- schema.OpenAIResponse{
ID: id, Created: created, Model: req.Model,
Choices: []schema.Choice{{
Delta: &schema.Message{Content: &contentDelta},
Index: 0,
}},
Object: "chat.completion.chunk",
}
}
// Try incremental XML parsing for streaming support using iterative parser
// This allows emitting partial tool calls as they're being generated
cleanedResult := functions.CleanupLLMResult(result, cfg.FunctionsConfig)
// Determine XML format from config
var xmlFormat *functions.XMLToolCallFormat
if cfg.FunctionsConfig.XMLFormat != nil {
xmlFormat = cfg.FunctionsConfig.XMLFormat
} else if cfg.FunctionsConfig.XMLFormatPreset != "" {
xmlFormat = functions.GetXMLFormatPreset(cfg.FunctionsConfig.XMLFormatPreset)
}
// Use iterative parser for streaming (partial parsing enabled)
// Try XML parsing first
partialResults, parseErr := functions.ParseXMLIterative(cleanedResult, xmlFormat, true)
if parseErr == nil && len(partialResults) > 0 {
// Emit new XML tool calls that weren't emitted before
if len(partialResults) > lastEmittedCount {
for i := lastEmittedCount; i < len(partialResults); i++ {
toolCall := partialResults[i]
initialMessage := schema.OpenAIResponse{
ID: id,
Created: created,
Model: req.Model,
Choices: []schema.Choice{{
Delta: &schema.Message{
Role: "assistant",
ToolCalls: []schema.ToolCall{
{
Index: i,
ID: id,
Type: "function",
FunctionCall: schema.FunctionCall{
Name: toolCall.Name,
},
},
},
},
Index: 0,
FinishReason: nil,
}},
Object: "chat.completion.chunk",
}
select {
case responses <- initialMessage:
default:
}
}
lastEmittedCount = len(partialResults)
}
} else {
// Try JSON tool call parsing for streaming.
// Only emit NEW tool calls (same guard as XML parser above).
jsonResults, jsonErr := functions.ParseJSONIterative(cleanedResult, true)
if jsonErr == nil && len(jsonResults) > lastEmittedCount {
for i := lastEmittedCount; i < len(jsonResults); i++ {
jsonObj := jsonResults[i]
name, ok := jsonObj["name"].(string)
if !ok || name == "" {
continue
}
args := "{}"
if argsVal, ok := jsonObj["arguments"]; ok {
if argsStr, ok := argsVal.(string); ok {
args = argsStr
} else {
argsBytes, _ := json.Marshal(argsVal)
args = string(argsBytes)
}
}
initialMessage := schema.OpenAIResponse{
ID: id,
Created: created,
Model: req.Model,
Choices: []schema.Choice{{
Delta: &schema.Message{
Role: "assistant",
ToolCalls: []schema.ToolCall{
{
Index: i,
ID: id,
Type: "function",
FunctionCall: schema.FunctionCall{
Name: name,
Arguments: args,
},
},
},
},
Index: 0,
FinishReason: nil,
}},
Object: "chat.completion.chunk",
}
responses <- initialMessage
}
lastEmittedCount = len(jsonResults)
}
}
return true
},
func(attempt int) bool {
// After streaming completes: check if we got actionable content
cleaned := extractor.CleanedContent()
// Check for tool calls from chat deltas (will be re-checked after ComputeChoices,
// but we need to know here whether to retry).
// Also check ChatDelta flags: when the C++ autoparser is active,
// tool calls and content are delivered via ChatDeltas while the
// raw message is cleared. Without this check, we'd retry
// unnecessarily, losing valid results and concatenating output.
hasToolCalls := lastEmittedCount > 0 || hasChatDeltaToolCalls
hasContent := cleaned != "" || hasChatDeltaContent
if !hasContent && !hasToolCalls {
xlog.Warn("Streaming: backend produced only reasoning, retrying",
"reasoning_len", len(extractor.Reasoning()), "attempt", attempt+1)
extractor.ResetAndSuppressReasoning()
result = ""
lastEmittedCount = 0
sentInitialRole = false
hasChatDeltaToolCalls = false
hasChatDeltaContent = false
return true
}
return false
},
)
if err != nil {
return finalUsage, err
}
// Try using pre-parsed tool calls from C++ autoparser (chat deltas)
var functionResults []functions.FuncCallResults
var reasoning string
if deltaToolCalls := functions.ToolCallsFromChatDeltas(chatDeltas); len(deltaToolCalls) > 0 {
xlog.Debug("[ChatDeltas] Using pre-parsed tool calls from C++ autoparser", "count", len(deltaToolCalls))
functionResults = deltaToolCalls
// Use content/reasoning from deltas too
*textContentToReturn = functions.ContentFromChatDeltas(chatDeltas)
reasoning = functions.ReasoningFromChatDeltas(chatDeltas)
} else {
// Fallback: parse tool calls from raw text (no chat deltas from backend)
xlog.Debug("[ChatDeltas] no pre-parsed tool calls, falling back to Go-side text parsing")
reasoning = extractor.Reasoning()
cleanedResult := extractor.CleanedContent()
*textContentToReturn = functions.ParseTextContent(cleanedResult, cfg.FunctionsConfig)
cleanedResult = functions.CleanupLLMResult(cleanedResult, cfg.FunctionsConfig)
functionResults = functions.ParseFunctionCall(cleanedResult, cfg.FunctionsConfig)
}
xlog.Debug("[ChatDeltas] final tool call decision", "tool_calls", len(functionResults), "text_content", *textContentToReturn)
// noAction is a sentinel "just answer" pseudo-function: not a real
// tool call. Scan the whole slice rather than only index 0 so we
// don't drop a real tool call that happens to follow a noAction
// entry, and so the default branch isn't entered with only noAction
// entries to emit as tool_calls.
noActionToRun := !hasRealCall(functionResults, noAction)
switch {
case noActionToRun:
// The final usage trailer (when the caller opted in with
// stream_options.include_usage) is built by the outer streaming
// loop from the TokenUsage this function returns, not from any
// chunk on the responses channel.
var result string
if !sentInitialRole {
var hqErr error
result, hqErr = handleQuestion(cfg, functionResults, extractor.CleanedContent(), prompt)
if hqErr != nil {
xlog.Error("error handling question", "error", hqErr)
return finalUsage, hqErr
}
}
for _, chunk := range buildNoActionFinalChunks(
id, req.Model, created,
sentInitialRole, sentReasoning,
result, reasoning,
) {
responses <- chunk
}
default:
for _, chunk := range buildDeferredToolCallChunks(
id, req.Model, created,
functionResults, lastEmittedCount,
sentInitialRole, *textContentToReturn,
sentReasoning, reasoning,
) {
responses <- chunk
}
}
close(responses)
return finalUsage, err
}

View File

@@ -17,16 +17,20 @@ import (
)
type APIExchangeRequest struct {
Method string `json:"method"`
Path string `json:"path"`
Headers *http.Header `json:"headers"`
Body *[]byte `json:"body"`
Method string `json:"method"`
Path string `json:"path"`
Headers *http.Header `json:"headers"`
Body *[]byte `json:"body"`
BodyTruncated bool `json:"body_truncated,omitempty"`
BodyBytes int `json:"body_bytes,omitempty"` // original size before truncation
}
type APIExchangeResponse struct {
Status int `json:"status"`
Headers *http.Header `json:"headers"`
Body *[]byte `json:"body"`
Status int `json:"status"`
Headers *http.Header `json:"headers"`
Body *[]byte `json:"body"`
BodyTruncated bool `json:"body_truncated,omitempty"`
BodyBytes int `json:"body_bytes,omitempty"` // original size before truncation
}
type APIExchange struct {
@@ -66,11 +70,29 @@ var doInitializeTracing = sync.OnceFunc(func() {
type bodyWriter struct {
http.ResponseWriter
body *bytes.Buffer
body *bytes.Buffer
maxBytes int // 0 = unlimited capture
truncated bool
totalBytes int // bytes the upstream handler wrote, even past the cap
}
func (w *bodyWriter) Write(b []byte) (int, error) {
w.body.Write(b)
// Capture into the trace buffer up to maxBytes, then drop the overflow
// so a chatty endpoint can't grow the buffer without bound. The full
// payload still flows through to the real client below.
w.totalBytes += len(b)
if w.maxBytes <= 0 {
w.body.Write(b)
} else if remain := w.maxBytes - w.body.Len(); remain > 0 {
if remain >= len(b) {
w.body.Write(b)
} else {
w.body.Write(b[:remain])
w.truncated = true
}
} else {
w.truncated = true
}
return w.ResponseWriter.Write(b)
}
@@ -80,6 +102,20 @@ func (w *bodyWriter) Flush() {
}
}
// truncateForTrace returns a defensive copy of body capped at maxBytes,
// and a flag indicating whether the cap forced truncation. maxBytes <= 0
// disables the cap.
func truncateForTrace(body []byte, maxBytes int) ([]byte, bool) {
if maxBytes <= 0 || len(body) <= maxBytes {
out := make([]byte, len(body))
copy(out, body)
return out, false
}
out := make([]byte, maxBytes)
copy(out, body[:maxBytes])
return out, true
}
func initializeTracing(maxItems int) {
tracingMaxItems = maxItems
doInitializeTracing()
@@ -134,11 +170,18 @@ func TraceMiddleware(app *application.Application) echo.MiddlewareFunc {
startTime := time.Now()
// Cap captured payload size. Without this, /embeddings and
// streaming /chat/completions blow the in-memory buffer into the
// tens of MB, which then locks the admin Traces UI fetching the
// JSON dump faster than the 5s auto-refresh.
maxBodyBytes := app.ApplicationConfig().TracingMaxBodyBytes
// Wrap response writer to capture body
resBody := new(bytes.Buffer)
mw := &bodyWriter{
ResponseWriter: c.Response().Writer,
body: resBody,
maxBytes: maxBodyBytes,
}
c.Response().Writer = mw
@@ -159,8 +202,7 @@ func TraceMiddleware(app *application.Application) echo.MiddlewareFunc {
// via any heap-dump-style introspection, and tokens shouldn't
// outlive the request that carried them.
requestHeaders := redactSensitiveHeaders(c.Request().Header)
requestBody := make([]byte, len(body))
copy(requestBody, body)
requestBody, requestTruncated := truncateForTrace(body, maxBodyBytes)
responseHeaders := redactSensitiveHeaders(c.Response().Header())
responseBody := make([]byte, resBody.Len())
copy(responseBody, resBody.Bytes())
@@ -168,15 +210,19 @@ func TraceMiddleware(app *application.Application) echo.MiddlewareFunc {
Timestamp: startTime,
Duration: time.Since(startTime),
Request: APIExchangeRequest{
Method: c.Request().Method,
Path: c.Path(),
Headers: &requestHeaders,
Body: &requestBody,
Method: c.Request().Method,
Path: c.Path(),
Headers: &requestHeaders,
Body: &requestBody,
BodyTruncated: requestTruncated,
BodyBytes: len(body),
},
Response: APIExchangeResponse{
Status: status,
Headers: &responseHeaders,
Body: &responseBody,
Status: status,
Headers: &responseHeaders,
Body: &responseBody,
BodyTruncated: mw.truncated,
BodyBytes: mw.totalBytes,
},
}
if handlerErr != nil {

View File

@@ -0,0 +1,116 @@
package middleware
import (
"bytes"
"net/http/httptest"
"strings"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// The trace middleware copies request and response bodies into an in-memory
// buffer that backs the admin /api/traces endpoint. With no upper bound a
// chatty workload (embeddings, large completions) trivially produces a
// multi-MB response that locks the Traces UI in a loading state — fetching
// and parsing the payload outruns the 5-second auto-refresh. These specs
// pin the capping contract so future refactors keep both the cap and the
// passthrough to the real client intact.
var _ = Describe("bodyWriter capping", func() {
It("captures the full body when maxBytes is 0 (unlimited)", func() {
downstream := httptest.NewRecorder()
buf := &bytes.Buffer{}
bw := &bodyWriter{ResponseWriter: downstream, body: buf, maxBytes: 0}
payload := []byte(strings.Repeat("x", 4096))
n, err := bw.Write(payload)
Expect(err).ToNot(HaveOccurred())
Expect(n).To(Equal(len(payload)))
Expect(buf.Len()).To(Equal(len(payload)))
Expect(downstream.Body.Len()).To(Equal(len(payload)))
Expect(bw.truncated).To(BeFalse())
})
It("stops appending to the trace buffer once maxBytes is reached but still forwards to the client", func() {
downstream := httptest.NewRecorder()
buf := &bytes.Buffer{}
bw := &bodyWriter{ResponseWriter: downstream, body: buf, maxBytes: 100}
payload := []byte(strings.Repeat("a", 250))
n, err := bw.Write(payload)
Expect(err).ToNot(HaveOccurred())
Expect(n).To(Equal(len(payload)), "Write must return the full byte count so callers see no short write")
Expect(buf.Len()).To(Equal(100), "trace buffer should hold exactly maxBytes")
Expect(downstream.Body.Len()).To(Equal(len(payload)), "client must still receive every byte")
Expect(bw.truncated).To(BeTrue())
})
It("handles a write that straddles the cap by keeping only the leading slice", func() {
downstream := httptest.NewRecorder()
buf := &bytes.Buffer{}
bw := &bodyWriter{ResponseWriter: downstream, body: buf, maxBytes: 10}
_, err := bw.Write([]byte("12345"))
Expect(err).ToNot(HaveOccurred())
Expect(bw.truncated).To(BeFalse())
_, err = bw.Write([]byte("67890ABCDE"))
Expect(err).ToNot(HaveOccurred())
Expect(buf.String()).To(Equal("1234567890"))
Expect(downstream.Body.String()).To(Equal("1234567890ABCDE"))
Expect(bw.truncated).To(BeTrue())
})
It("ignores further writes after the cap was already hit", func() {
downstream := httptest.NewRecorder()
buf := &bytes.Buffer{}
bw := &bodyWriter{ResponseWriter: downstream, body: buf, maxBytes: 4}
_, _ = bw.Write([]byte("AAAA"))
_, _ = bw.Write([]byte("BBBB"))
_, _ = bw.Write([]byte("CCCC"))
Expect(buf.String()).To(Equal("AAAA"))
Expect(downstream.Body.String()).To(Equal("AAAABBBBCCCC"))
Expect(bw.truncated).To(BeTrue())
})
})
var _ = Describe("truncateForTrace", func() {
It("returns the input unchanged when below the cap", func() {
in := []byte("hello")
out, truncated := truncateForTrace(in, 1024)
Expect(truncated).To(BeFalse())
Expect(out).To(Equal(in))
})
It("truncates when the input exceeds the cap and signals truncation", func() {
in := []byte(strings.Repeat("z", 200))
out, truncated := truncateForTrace(in, 64)
Expect(truncated).To(BeTrue())
Expect(out).To(HaveLen(64))
Expect(string(out)).To(Equal(strings.Repeat("z", 64)))
})
It("treats maxBytes <= 0 as unlimited (back-compat with current default)", func() {
in := []byte(strings.Repeat("q", 10_000))
out, truncated := truncateForTrace(in, 0)
Expect(truncated).To(BeFalse())
Expect(out).To(HaveLen(len(in)))
})
It("does not retain the caller's backing array (defensive copy)", func() {
in := []byte("abcdefghij")
out, truncated := truncateForTrace(in, 4)
Expect(truncated).To(BeTrue())
Expect(string(out)).To(Equal("abcd"))
// Mutating the source must not corrupt the trace copy.
in[0] = 'Z'
Expect(string(out)).To(Equal("abcd"))
})
})

View File

@@ -4,6 +4,7 @@ import (
"bytes"
"encoding/json"
"sync"
"sync/atomic"
"time"
"github.com/labstack/echo/v4"
@@ -14,18 +15,37 @@ import (
const (
usageFlushInterval = 5 * time.Second
usageMaxPending = 5000
// usageMaxPending bounds the in-memory queue. Sized for bursty inference
// traffic on a self-hosted instance with a slow or unavailable DB.
usageMaxPending = 50000
)
// usageBatcher accumulates usage records and flushes them to the DB periodically.
type usageBatcher struct {
mu sync.Mutex
pending []*auth.UsageRecord
db *gorm.DB
mu sync.Mutex
pending []*auth.UsageRecord
db *gorm.DB
stop chan struct{}
done chan struct{}
stopOnce sync.Once
}
// droppedRecords counts records discarded because the in-memory queue was full.
// Used to rate-limit the warn log so a sustained outage doesn't flood it.
var droppedRecords atomic.Uint64
func (b *usageBatcher) add(r *auth.UsageRecord) {
b.mu.Lock()
if len(b.pending) >= usageMaxPending {
b.mu.Unlock()
// Rate-limit: one warn per 1024 drops keeps the log readable.
n := droppedRecords.Add(1)
if n&1023 == 1 {
xlog.Warn("usage batcher full, dropping record",
"cap", usageMaxPending, "total_dropped", n)
}
return
}
b.pending = append(b.pending, r)
b.mu.Unlock()
}
@@ -42,31 +62,102 @@ func (b *usageBatcher) flush() {
if err := b.db.Create(&batch).Error; err != nil {
xlog.Error("Failed to flush usage batch", "count", len(batch), "error", err)
// Re-queue failed records with a cap to avoid unbounded growth
// Cap-aware re-queue: prepend as much of the failed batch as fits
// alongside any records added concurrently with the failed write.
b.mu.Lock()
if len(b.pending) < usageMaxPending {
b.pending = append(batch, b.pending...)
room := usageMaxPending - len(b.pending)
if room > 0 {
if room > len(batch) {
room = len(batch)
}
b.pending = append(batch[:room], b.pending...)
}
b.mu.Unlock()
}
}
var batcher *usageBatcher
func (b *usageBatcher) run() {
defer close(b.done)
ticker := time.NewTicker(usageFlushInterval)
defer ticker.Stop()
for {
select {
case <-ticker.C:
b.flush()
case <-b.stop:
b.flush() // final drain
return
}
}
}
func (b *usageBatcher) shutdown() {
b.stopOnce.Do(func() {
close(b.stop)
<-b.done
})
}
// The package-level batcher is guarded by batcherMu so Init / Shutdown cycles
// (the test pattern) don't race against UsageMiddleware reads.
var (
batcherMu sync.RWMutex
batcher *usageBatcher
)
func currentBatcher() *usageBatcher {
batcherMu.RLock()
defer batcherMu.RUnlock()
return batcher
}
// InitUsageRecorder starts a background goroutine that periodically flushes
// accumulated usage records to the database.
// accumulated usage records to the database. Calling it more than once
// shuts down the previous batcher first so its goroutine doesn't leak.
func InitUsageRecorder(db *gorm.DB) {
if db == nil {
return
}
batcher = &usageBatcher{db: db}
go func() {
ticker := time.NewTicker(usageFlushInterval)
defer ticker.Stop()
for range ticker.C {
batcher.flush()
}
}()
batcherMu.Lock()
old := batcher
batcher = nil
batcherMu.Unlock()
if old != nil {
old.shutdown()
}
b := &usageBatcher{
db: db,
stop: make(chan struct{}),
done: make(chan struct{}),
}
batcherMu.Lock()
batcher = b
batcherMu.Unlock()
go b.run()
}
// ShutdownUsageRecorder stops the background flusher and synchronously drains
// pending records once. Safe to call multiple times. Not yet wired into the
// application lifecycle; intended for graceful process exit and tests.
func ShutdownUsageRecorder() {
batcherMu.Lock()
b := batcher
batcher = nil
batcherMu.Unlock()
if b != nil {
b.shutdown()
}
}
// FlushNow synchronously flushes any pending usage records. Intended for tests
// that need deterministic behaviour without waiting for the ticker.
func FlushNow() {
if b := currentBatcher(); b != nil {
b.flush()
}
}
// usageResponseBody is the minimal structure we need from the response JSON.
@@ -84,7 +175,8 @@ type usageResponseBody struct {
func UsageMiddleware(db *gorm.DB) echo.MiddlewareFunc {
return func(next echo.HandlerFunc) echo.HandlerFunc {
return func(c echo.Context) error {
if db == nil || batcher == nil {
b := currentBatcher()
if db == nil || b == nil {
return next(c)
}
@@ -149,9 +241,17 @@ func UsageMiddleware(db *gorm.DB) echo.MiddlewareFunc {
return handlerErr
}
source := auth.GetSource(c)
if source == "" {
// Auth disabled or unrecognised path: classify as web so the row is still
// bucketable rather than silently dropped from per-source aggregates.
source = auth.UsageSourceWeb
}
record := &auth.UsageRecord{
UserID: user.ID,
UserName: user.Name,
Source: source,
Model: resp.Model,
Endpoint: c.Request().URL.Path,
PromptTokens: resp.Usage.PromptTokens,
@@ -161,7 +261,13 @@ func UsageMiddleware(db *gorm.DB) echo.MiddlewareFunc {
CreatedAt: startTime,
}
batcher.add(record)
if key := auth.GetAPIKey(c); key != nil {
id := key.ID
record.APIKeyID = &id
record.APIKeyName = key.Name
}
b.add(record)
return handlerErr
}

View File

@@ -0,0 +1,140 @@
//go:build auth
package middleware_test
import (
"bytes"
"encoding/json"
"net/http"
"net/http/httptest"
"github.com/labstack/echo/v4"
"github.com/mudler/LocalAI/core/http/auth"
"github.com/mudler/LocalAI/core/http/middleware"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"gorm.io/gorm"
)
// testAuthDB returns a fresh in-memory SQLite auth DB.
func testAuthDB() *gorm.DB {
db, err := auth.InitDB(":memory:")
if err != nil {
panic(err)
}
return db
}
var _ = Describe("UsageMiddleware", func() {
var (
e *echo.Echo
db *gorm.DB
)
BeforeEach(func() {
db = testAuthDB()
e = echo.New()
middleware.InitUsageRecorder(db)
})
AfterEach(func() {
middleware.ShutdownUsageRecorder()
})
okHandler := func(c echo.Context) error {
body, _ := json.Marshal(map[string]any{
"model": "gpt-4",
"usage": map[string]int{
"prompt_tokens": 10, "completion_tokens": 5, "total_tokens": 15,
},
})
c.Response().Header().Set("Content-Type", "application/json")
c.Response().WriteHeader(http.StatusOK)
_, _ = c.Response().Write(body)
return nil
}
// FlushNow drains pending records synchronously, replacing the 6s sleep
// that was previously needed to wait for the batcher's ticker.
flush := middleware.FlushNow
It("records source=web when auth_source is web", func() {
e.POST("/v1/chat/completions", okHandler, func(next echo.HandlerFunc) echo.HandlerFunc {
return func(c echo.Context) error {
c.Set("auth_user", &auth.User{ID: "alice", Name: "Alice"})
c.Set("auth_source", auth.UsageSourceWeb)
return next(c)
}
}, middleware.UsageMiddleware(db))
req := httptest.NewRequest("POST", "/v1/chat/completions", bytes.NewReader([]byte(`{}`)))
e.ServeHTTP(httptest.NewRecorder(), req)
flush()
var rec auth.UsageRecord
Expect(db.Where("user_id = ?", "alice").First(&rec).Error).To(Succeed())
Expect(rec.Source).To(Equal(auth.UsageSourceWeb))
Expect(rec.APIKeyID).To(BeNil())
Expect(rec.APIKeyName).To(BeEmpty())
})
It("records source=apikey with snapshotted name when auth_apikey is set", func() {
e.POST("/v1/chat/completions", okHandler, func(next echo.HandlerFunc) echo.HandlerFunc {
return func(c echo.Context) error {
c.Set("auth_user", &auth.User{ID: "alice", Name: "Alice"})
c.Set("auth_source", auth.UsageSourceAPIKey)
c.Set("auth_apikey", &auth.UserAPIKey{ID: "key-1", Name: "ci-runner"})
return next(c)
}
}, middleware.UsageMiddleware(db))
req := httptest.NewRequest("POST", "/v1/chat/completions", bytes.NewReader([]byte(`{}`)))
e.ServeHTTP(httptest.NewRecorder(), req)
flush()
var rec auth.UsageRecord
Expect(db.Where("user_id = ?", "alice").First(&rec).Error).To(Succeed())
Expect(rec.Source).To(Equal(auth.UsageSourceAPIKey))
Expect(rec.APIKeyID).ToNot(BeNil())
Expect(*rec.APIKeyID).To(Equal("key-1"))
Expect(rec.APIKeyName).To(Equal("ci-runner"))
})
It("FlushNow drains pending records synchronously", func() {
e.POST("/v1/chat/completions", okHandler, func(next echo.HandlerFunc) echo.HandlerFunc {
return func(c echo.Context) error {
c.Set("auth_user", &auth.User{ID: "carol", Name: "Carol"})
c.Set("auth_source", auth.UsageSourceWeb)
return next(c)
}
}, middleware.UsageMiddleware(db))
req := httptest.NewRequest("POST", "/v1/chat/completions", bytes.NewReader([]byte(`{}`)))
e.ServeHTTP(httptest.NewRecorder(), req)
// No sleep: FlushNow should drain immediately.
middleware.FlushNow()
var rec auth.UsageRecord
Expect(db.Where("user_id = ?", "carol").First(&rec).Error).To(Succeed())
Expect(rec.Source).To(Equal(auth.UsageSourceWeb))
})
It("falls back to source=web when auth_source is empty", func() {
e.POST("/v1/chat/completions", okHandler, func(next echo.HandlerFunc) echo.HandlerFunc {
return func(c echo.Context) error {
c.Set("auth_user", &auth.User{ID: "alice", Name: "Alice"})
// no auth_source set
return next(c)
}
}, middleware.UsageMiddleware(db))
req := httptest.NewRequest("POST", "/v1/chat/completions", bytes.NewReader([]byte(`{}`)))
e.ServeHTTP(httptest.NewRecorder(), req)
flush()
var rec auth.UsageRecord
Expect(db.Where("user_id = ?", "alice").First(&rec).Error).To(Succeed())
Expect(rec.Source).To(Equal(auth.UsageSourceWeb))
})
})

View File

@@ -52,11 +52,22 @@ test.describe('Traces Settings', () => {
await page.locator('button', { hasText: 'Tracing is' }).click()
await expect(page.locator('text=Enable Tracing')).toBeVisible()
const maxItemsInput = page.locator('input[type="number"]')
// The Tracing panel has two numeric inputs (Max Items and Max Body Bytes).
// Disambiguate by placeholder so adding a third field later doesn't break this.
const maxItemsInput = page.getByPlaceholder('100')
await maxItemsInput.fill('500')
await expect(maxItemsInput).toHaveValue('500')
})
test('set max body bytes value', async ({ page }) => {
await page.locator('button', { hasText: 'Tracing is' }).click()
await expect(page.locator('text=Enable Tracing')).toBeVisible()
const maxBodyBytesInput = page.getByPlaceholder('65536')
await maxBodyBytesInput.fill('16384')
await expect(maxBodyBytesInput).toHaveValue('16384')
})
test('save shows toast', async ({ page }) => {
// Expand settings
await page.locator('button', { hasText: 'Tracing is' }).click()

View File

@@ -53,7 +53,30 @@
},
"usage": {
"title": "Usage",
"subtitle": "API token usage statistics"
"subtitle": "API token usage statistics",
"sources": {
"tab": "Sources",
"mixTitle": "Source mix",
"ribbonAria": "{{apikey}}% API keys, {{web}}% Web UI, {{legacy}}% Legacy",
"topSources": "Top sources over time",
"searchPlaceholder": "Search by name or prefix",
"sortBy": "Sort",
"sortTokens": "Tokens",
"sortRequests": "Requests",
"sortLastUsed": "Last used",
"sortName": "Name",
"sortUser": "User",
"webUI": "Web UI",
"legacy": "Legacy",
"revoked": "revoked",
"filteredTo": "Filtered to: {{name}}",
"clearFilter": "Clear filter",
"other": "Other ({{count}})",
"noTrafficShort": "No requests in this period.",
"noKeysYet": "Once requests come in, you'll see them broken down here.",
"createKey": "Create your first API key",
"truncatedWarning": "Showing top 200 keys. Apply a filter to narrow further."
}
},
"explorer": {
"title": "Explorer",

View File

@@ -649,6 +649,7 @@
align-items: center;
gap: var(--spacing-md);
padding: var(--spacing-xs) 0;
flex-wrap: wrap;
}
.operation-info {
@@ -739,6 +740,110 @@
color: var(--color-error);
}
/* Operations bar: per-node breakdown (multi-worker installs) */
.operation-expand {
background: none;
border: none;
color: var(--color-text-muted);
cursor: pointer;
padding: 0 var(--spacing-xs);
font-size: var(--text-xs);
display: inline-flex;
align-items: center;
gap: 0.25rem;
}
.operation-expand:hover {
color: var(--color-text-primary);
}
.operation-expand-label {
font-size: var(--text-xs);
}
.operation-nodes-list {
list-style: none;
margin: var(--spacing-xs) 0 0;
padding: var(--spacing-xs) 0 0;
border-top: 1px solid var(--color-border-subtle);
flex-basis: 100%;
width: 100%;
}
.operation-node {
display: flex;
align-items: center;
gap: var(--spacing-sm);
padding: var(--spacing-xs) 0;
font-size: var(--text-xs);
color: var(--color-text-muted);
flex-wrap: wrap;
}
.operation-node-status {
padding: 2px 6px;
border-radius: var(--radius-md);
font-size: 0.65rem;
font-weight: 600;
text-transform: uppercase;
letter-spacing: 0.025em;
white-space: nowrap;
}
.operation-node-status-success {
background: var(--color-success-light);
color: var(--color-success);
}
.operation-node-status-error {
background: var(--color-error-light);
color: var(--color-error);
}
.operation-node-status-queued {
background: var(--color-bg-tertiary);
color: var(--color-text-muted);
}
.operation-node-status-running_on_worker {
background: var(--color-warning-light);
color: var(--color-warning);
}
.operation-node-status-downloading {
background: var(--color-primary-light);
color: var(--color-primary);
}
.operation-node-name {
font-weight: 500;
color: var(--color-text-secondary);
}
.operation-node-file {
font-family: var(--font-mono);
color: var(--color-text-tertiary);
overflow: hidden;
text-overflow: ellipsis;
max-width: 30ch;
white-space: nowrap;
}
.operation-node-bytes {
font-variant-numeric: tabular-nums;
color: var(--color-text-tertiary);
}
.operation-node-pct {
font-variant-numeric: tabular-nums;
color: var(--color-primary);
font-weight: 500;
}
.operation-node-error {
color: var(--color-error);
}
.operation-node-bar-container {
flex-basis: 100%;
height: 3px;
background: var(--color-surface-sunken);
border-radius: var(--radius-full);
overflow: hidden;
margin-top: 0.25rem;
}
.operation-node-bar {
height: 100%;
background: var(--color-primary);
border-radius: var(--radius-full);
transition: width var(--duration-slow, 0.3s) var(--ease-spring, ease);
}
/* Toast */
.toast-container {
position: fixed;

View File

@@ -1,7 +1,7 @@
import { useState, useMemo, useEffect, useRef } from 'react'
import Modal from './Modal'
import SearchableSelect from './SearchableSelect'
import { nodesApi } from '../utils/api'
import { nodesApi, backendsApi } from '../utils/api'
// NodeInstallPicker is the single multi-node install surface used both from
// the Backends gallery split-button and from the "Install on more nodes" `+`
@@ -240,6 +240,37 @@ export default function NodeInstallPicker({
}
const clearSelection = () => setSelected(new Set())
// pollJob resolves with { done: true, error?: string } once a single job
// completes, fails, or is cancelled. Bounded by a hard wall-clock cap so a
// stuck worker eventually surfaces in the UI as "Failed" instead of
// spinning forever.
const pollJob = (jobID) => new Promise((resolve) => {
const POLL_INTERVAL_MS = 1500
const HARD_CAP_MS = 6 * 60 * 1000 // 6 min - generous for a fresh worker download
const startedAt = Date.now()
const tick = async () => {
try {
const status = await backendsApi.getJob(jobID)
if (status?.completed) { resolve({ done: true }); return }
if (status?.error) { resolve({ done: true, error: status.error }); return }
if (status?.processed && !status?.completed) {
resolve({ done: true, error: status.error || 'install did not complete' })
return
}
} catch (err) {
resolve({ done: true, error: err?.message || 'polling failed' })
return
}
if (Date.now() - startedAt > HARD_CAP_MS) {
resolve({ done: true, error: 'timed out waiting for install to finish' })
return
}
setTimeout(tick, POLL_INTERVAL_MS)
}
tick()
})
const submit = async () => {
if (selected.size === 0 || submitting) return
if (counts.overrides > 0 && !showMismatchConfirm) {
@@ -255,38 +286,68 @@ export default function NodeInstallPicker({
return next
})
const results = await Promise.allSettled(ids.map(id =>
// Phase 1: dispatch all installs in parallel. Each POST returns immediately
// with { jobID } now that the handler is async.
const dispatchResults = await Promise.allSettled(ids.map(id =>
nodesApi.installBackend(id, effectiveBackendName)
.then(r => ({ id, ok: true, message: r?.message }))
.catch(err => ({ id, ok: false, error: err?.message || 'install failed' }))
.then(r => ({ id, ok: true, jobID: r?.jobID }))
.catch(err => ({ id, ok: false, error: err?.message || 'dispatch failed' }))
))
let successCount = 0, failCount = 0
setPerNode(prev => {
const next = { ...prev }
for (const r of results) {
if (r.status !== 'fulfilled') continue
const v = r.value
if (v.ok) {
next[v.id] = { status: 'done' }
successCount++
} else {
next[v.id] = { status: 'error', error: v.error }
failCount++
}
// Classify dispatch results synchronously OUTSIDE the setter. React may
// invoke a functional state updater more than once (StrictMode dev double
// invoke, concurrent rendering replay): building the jobs array inside
// the closure would duplicate entries and re-poll the same job.
const jobs = []
const dispatchPatch = {}
for (const r of dispatchResults) {
if (r.status !== 'fulfilled') continue
const v = r.value
if (v.ok && v.jobID) {
dispatchPatch[v.id] = { status: 'installing', jobID: v.jobID }
jobs.push({ nodeID: v.id, jobID: v.jobID })
} else {
dispatchPatch[v.id] = { status: 'error', error: v.error || 'dispatch failed' }
}
return next
}
setPerNode(prev => ({ ...prev, ...dispatchPatch }))
// Phase 2: poll each job. Promise.all resolves when the last job settles;
// intermediate updates flip per-row state via the setPerNode inside pollJob.
await Promise.all(jobs.map(async ({ nodeID, jobID }) => {
const result = await pollJob(jobID)
setPerNode(prev => {
const next = { ...prev }
if (result.error) {
next[nodeID] = { status: 'error', error: result.error, jobID }
} else {
next[nodeID] = { status: 'done', jobID }
}
return next
})
}))
// Phase 3: summary toast + onComplete. Read latest state via functional setter.
let successCount = 0
let failCount = 0
setPerNode(prev => {
for (const v of Object.values(prev)) {
if (v.status === 'done') successCount++
else if (v.status === 'error') failCount++
}
return prev
})
setSubmitting(false)
if (successCount > 0 && onComplete) onComplete()
if (failCount === 0) {
if (failCount === 0 && successCount > 0) {
addToast?.(`Installed on ${successCount} node${successCount === 1 ? '' : 's'}`, 'success')
setTimeout(() => onClose?.(), 800)
} else if (successCount === 0) {
} else if (successCount === 0 && failCount > 0) {
addToast?.(`Install failed on all ${failCount} node${failCount === 1 ? '' : 's'}`, 'error')
} else {
} else if (successCount > 0 && failCount > 0) {
addToast?.(`Installed on ${successCount}, failed on ${failCount}`, 'warning')
}
}
@@ -297,32 +358,58 @@ export default function NodeInstallPicker({
.map(([id]) => id)
if (failedIds.length === 0) return
setSelected(new Set(failedIds))
// Replace state for failed rows so they show "installing" again, not stale errors.
setPerNode(prev => {
const next = { ...prev }
failedIds.forEach(id => { next[id] = { status: 'installing' } })
return next
})
setSubmitting(true)
const results = await Promise.allSettled(failedIds.map(id =>
const dispatchResults = await Promise.allSettled(failedIds.map(id =>
nodesApi.installBackend(id, effectiveBackendName)
.then(r => ({ id, ok: true, message: r?.message }))
.catch(err => ({ id, ok: false, error: err?.message || 'install failed' }))
.then(r => ({ id, ok: true, jobID: r?.jobID }))
.catch(err => ({ id, ok: false, error: err?.message || 'dispatch failed' }))
))
// Same precaution as in submit(): classify outside the functional setter
// so a replayed updater can't push duplicate jobs into the polling list.
const jobs = []
const dispatchPatch = {}
for (const r of dispatchResults) {
if (r.status !== 'fulfilled') continue
const v = r.value
if (v.ok && v.jobID) {
dispatchPatch[v.id] = { status: 'installing', jobID: v.jobID }
jobs.push({ nodeID: v.id, jobID: v.jobID })
} else {
dispatchPatch[v.id] = { status: 'error', error: v.error || 'dispatch failed' }
}
}
setPerNode(prev => ({ ...prev, ...dispatchPatch }))
await Promise.all(jobs.map(async ({ nodeID, jobID }) => {
const result = await pollJob(jobID)
setPerNode(prev => {
const next = { ...prev }
if (result.error) next[nodeID] = { status: 'error', error: result.error, jobID }
else next[nodeID] = { status: 'done', jobID }
return next
})
}))
setSubmitting(false)
let successCount = 0, failCount = 0
setPerNode(prev => {
const next = { ...prev }
for (const r of results) {
if (r.status !== 'fulfilled') continue
const v = r.value
if (v.ok) { next[v.id] = { status: 'done' }; successCount++ }
else { next[v.id] = { status: 'error', error: v.error }; failCount++ }
for (const id of failedIds) {
const v = prev[id]
if (v?.status === 'done') successCount++
else if (v?.status === 'error') failCount++
}
return next
return prev
})
setSubmitting(false)
if (successCount > 0 && onComplete) onComplete()
if (failCount === 0) {
if (failCount === 0 && successCount > 0) {
addToast?.(`Installed on ${successCount} node${successCount === 1 ? '' : 's'}`, 'success')
setTimeout(() => onClose?.(), 800)
}

View File

@@ -1,14 +1,33 @@
import { useState } from 'react'
import { useOperations } from '../hooks/useOperations'
const nodeStatusLabels = {
success: 'Done',
error: 'Failed',
queued: 'Queued',
running_on_worker: 'Worker busy',
downloading: 'Downloading',
}
const runningOnWorkerTooltip = 'NATS round-trip timed out, but the worker is still installing in the background. The reconciler will confirm completion.'
export default function OperationsBar() {
const { operations, cancelOperation, dismissFailedOp } = useOperations()
const [expanded, setExpanded] = useState({})
if (operations.length === 0) return null
const toggle = (key) => setExpanded((m) => ({ ...m, [key]: !m[key] }))
return (
<div className="operations-bar">
{operations.map(op => (
<div key={op.jobID || op.id} className="operation-item">
{operations.map(op => {
const key = op.jobID || op.id
const nodes = Array.isArray(op.nodes) ? op.nodes : []
const canExpand = nodes.length > 1
const isOpen = !!expanded[key]
return (
<div key={key} className="operation-item">
<div className="operation-info">
{op.error ? (
<i className="fas fa-circle-exclamation" style={{ color: 'var(--color-error)', marginRight: 'var(--spacing-xs)' }} />
@@ -80,8 +99,55 @@ export default function OperationsBar() {
<i className="fas fa-xmark" />
</button>
) : null}
{canExpand && (
<button
type="button"
className="operation-expand"
onClick={() => toggle(key)}
aria-expanded={isOpen}
title={isOpen ? 'Hide per-node detail' : `Show ${nodes.length} nodes`}
>
<i className={`fas fa-chevron-${isOpen ? 'up' : 'down'}`} />
<span className="operation-expand-label">{nodes.length} nodes</span>
</button>
)}
{canExpand && isOpen && (
<ul className="operation-nodes-list">
{nodes.map((n) => (
<li key={n.node_id} className={`operation-node operation-node-${n.status}`}>
<span
className={`operation-node-status operation-node-status-${n.status}`}
title={n.status === 'running_on_worker' ? runningOnWorkerTooltip : undefined}
>
{nodeStatusLabels[n.status] || n.status}
</span>
<span className="operation-node-name">{n.node_name || n.node_id}</span>
{n.file_name && <span className="operation-node-file">{n.file_name}</span>}
{(n.current || n.total) && (
<span className="operation-node-bytes">
{n.current || '?'} / {n.total || '?'}
</span>
)}
{n.percentage > 0 && (
<span className="operation-node-pct">{Math.round(n.percentage)}%</span>
)}
{n.error && (
<span className="operation-node-error" title={n.error}>
{n.error.length > 80 ? n.error.slice(0, 80) + '...' : n.error}
</span>
)}
{n.percentage > 0 && n.percentage < 100 && (
<div className="operation-node-bar-container">
<div className="operation-node-bar" style={{ width: `${n.percentage}%` }} />
</div>
)}
</li>
))}
</ul>
)}
</div>
))}
)
})}
</div>
)
}

View File

@@ -1,9 +1,10 @@
import { useState, useEffect, useCallback, useRef, useMemo } from 'react'
import { useParams, useSearchParams, useOutletContext, Link } from 'react-router-dom'
import { backendLogsApi } from '../utils/api'
import { useParams, useSearchParams, useOutletContext, Link, Navigate } from 'react-router-dom'
import { backendLogsApi, nodesApi } from '../utils/api'
import { formatTimestamp } from '../utils/format'
import { apiUrl } from '../utils/basePath'
import LoadingSpinner from '../components/LoadingSpinner'
import { useDistributedMode } from '../hooks/useDistributedMode'
function wsUrl(path) {
const proto = window.location.protocol === 'https:' ? 'wss:' : 'ws:'
@@ -274,11 +275,158 @@ function BackendLogsDetail({ modelId }) {
)
}
// DistributedBackendLogsResolver runs only in distributed mode. The local
// /api/backend-logs WebSocket has no backend behind it here (inference lives
// on workers), so we resolve modelId → hosting node(s) and forward to the
// per-node logs page. One hit redirects automatically; multiple hits render
// a picker so the operator can pick which worker's logs to inspect.
function DistributedBackendLogsResolver({ modelId, fromTimestamp }) {
const [hits, setHits] = useState(null) // [{ node, model }] once resolved
const [error, setError] = useState(null)
useEffect(() => {
let cancelled = false
;(async () => {
try {
const nodes = await nodesApi.list()
const nodeList = Array.isArray(nodes) ? nodes : []
// Fan out to each node and collect entries that match this model.
// Per-node failures are tolerated — a single offline worker shouldn't
// hide logs available on its peers.
const perNode = await Promise.all(nodeList.map(async (node) => {
try {
const models = await nodesApi.getModels(node.id)
const matches = (Array.isArray(models) ? models : []).filter(m => m.model_name === modelId)
return matches.map(m => ({ node, model: m }))
} catch {
return []
}
}))
if (cancelled) return
setHits(perNode.flat())
} catch (err) {
if (!cancelled) setError(err)
}
})()
return () => { cancelled = true }
}, [modelId])
if (error) {
return (
<div className="page page--wide">
<div className="empty-state">
<div className="empty-state-icon"><i className="fas fa-exclamation-triangle" /></div>
<h2 className="empty-state-title">Failed to resolve hosting nodes</h2>
<p className="empty-state-text">{error.message}</p>
</div>
</div>
)
}
if (hits === null) {
return (
<div style={{ display: 'flex', justifyContent: 'center', padding: 'var(--spacing-xl)' }}>
<LoadingSpinner size="lg" />
</div>
)
}
if (hits.length === 0) {
return (
<div className="page page--wide">
<div className="empty-state">
<div className="empty-state-icon"><i className="fas fa-terminal" /></div>
<h2 className="empty-state-title">Model not loaded on any worker</h2>
<p className="empty-state-text">
<span style={{ fontFamily: 'var(--font-mono)' }}>{modelId}</span> isn't currently loaded on any node in the cluster.
Check the <Link to="/app/nodes" style={{ color: 'var(--color-primary)' }}>Nodes page</Link> to see which models are running where.
</p>
</div>
</div>
)
}
// Bare model name aggregates this node's replicas via the worker's log
// store; preserve ?from= so the deep-link from a trace still scrolls to
// the right line on arrival.
const buildHref = (nodeId) => {
const base = `/app/node-backend-logs/${nodeId}/${encodeURIComponent(modelId)}`
return fromTimestamp ? `${base}?from=${encodeURIComponent(fromTimestamp)}` : base
}
if (hits.length === 1) {
return <Navigate to={buildHref(hits[0].node.id)} replace />
}
// Multiple workers host this model — let the operator pick.
return (
<div className="page page--wide">
<div className="page-header">
<div>
<h1 className="page-title" style={{ marginBottom: 0 }}>
<i className="fas fa-terminal" style={{ fontSize: '0.8em', marginRight: 'var(--spacing-sm)' }} />
{modelId}
</h1>
<p className="page-subtitle" style={{ marginTop: 'var(--spacing-xs)' }}>
Hosted on {hits.length} workers — pick one to view its logs.
</p>
</div>
</div>
<div style={{ display: 'flex', flexDirection: 'column', gap: 'var(--spacing-xs)' }}>
{hits.map(({ node, model }) => (
<Link
key={`${node.id}#${model.replica_index ?? 0}`}
to={buildHref(node.id)}
style={{
display: 'flex', alignItems: 'center', justifyContent: 'space-between',
padding: 'var(--spacing-sm) var(--spacing-md)',
background: 'var(--color-bg-primary)', border: '1px solid var(--color-border)',
borderRadius: 'var(--radius-md)', textDecoration: 'none', color: 'inherit',
}}
>
<div>
<div style={{ fontWeight: 500 }}>{node.name || node.id}</div>
<div style={{ fontSize: '0.75rem', color: 'var(--color-text-secondary)', fontFamily: 'var(--font-mono)' }}>
{node.id}{model.replica_index ? ` · replica ${model.replica_index}` : ''} · {model.state}
</div>
</div>
<i className="fas fa-chevron-right" style={{ color: 'var(--color-text-muted)' }} />
</Link>
))}
</div>
</div>
)
}
// BackendLogsRouter picks between the local WebSocket view (standalone) and
// the distributed resolver. The probe runs once via useDistributedMode so a
// 503 from /api/nodes (the canonical "distributed disabled" signal) keeps the
// existing standalone path intact.
function BackendLogsRouter({ modelId }) {
const [searchParams] = useSearchParams()
const fromTimestamp = searchParams.get('from')
const { enabled: distributedMode, loading } = useDistributedMode()
if (loading) {
return (
<div style={{ display: 'flex', justifyContent: 'center', padding: 'var(--spacing-xl)' }}>
<LoadingSpinner size="lg" />
</div>
)
}
if (distributedMode) {
return <DistributedBackendLogsResolver modelId={modelId} fromTimestamp={fromTimestamp} />
}
return <BackendLogsDetail modelId={modelId} />
}
export default function BackendLogs() {
const { modelId } = useParams()
if (modelId) {
return <BackendLogsDetail modelId={decodeURIComponent(modelId)} />
return <BackendLogsRouter modelId={decodeURIComponent(modelId)} />
}
// No model specified — redirect to System page

View File

@@ -179,16 +179,19 @@ export default function Backends() {
// Install a single gallery backend on a specific node, used in target-node
// mode (the URL has ?target=<node-id> set from the Nodes page entry point).
// The handler is async - we dispatch and let the global Operations panel
// surface progress; no need to await completion here.
const handleInstallOnTarget = async (id) => {
if (!targetNode) return
try {
await nodesApi.installBackend(targetNode.id, id)
addToast(`Installing ${id} on ${targetNode.name}`, 'info')
// Per-node install is request-reply, not part of the global jobs feed —
// refetch to reflect the new Nodes column state.
setTimeout(() => { fetchBackends(); refetchNodes() }, 600)
addToast(`Installing ${id} on ${targetNode.name}...`, 'info')
// The install runs async via the gallery job queue. Refetch shortly so
// the Nodes column reflects "installing" state; the Operations panel
// tracks the actual progress until completion.
setTimeout(() => { fetchBackends(); refetchNodes() }, 1200)
} catch (err) {
addToast(`Install failed on ${targetNode.name}: ${err.message}`, 'error')
addToast(`Install dispatch failed on ${targetNode.name}: ${err.message}`, 'error')
}
}

View File

@@ -660,8 +660,7 @@ export default function Manage() {
{ key: 'edit', icon: 'fa-pen-to-square', label: 'Edit configuration',
onClick: () => navigate(`/app/model-editor/${encodeURIComponent(model.id)}`) },
{ key: 'logs', icon: 'fa-terminal', label: 'Backend logs',
onClick: () => navigate(`/app/backend-logs/${encodeURIComponent(model.id)}`),
hidden: distributedMode },
onClick: () => navigate(`/app/backend-logs/${encodeURIComponent(model.id)}`) },
{ divider: true },
{ key: 'delete', icon: 'fa-trash', label: 'Delete model', danger: true,
onClick: () => handleDeleteModel(model.id) },

View File

@@ -435,6 +435,9 @@ export default function Settings() {
<SettingRow label="Max Items" description="Maximum number of trace items to retain (0 = unlimited)">
<input className="input" type="number" style={{ width: 120 }} value={settings.tracing_max_items ?? ''} onChange={(e) => update('tracing_max_items', parseInt(e.target.value) || 0)} placeholder="100" disabled={!settings.enable_tracing} />
</SettingRow>
<SettingRow label="Max Body Bytes" description="Per-field cap (bytes) for captured request/response bodies and backend trace Data fields. Prevents large LLM histories or TTS audio snippets from locking the Traces UI. 0 = uncapped.">
<input className="input" type="number" style={{ width: 120 }} value={settings.tracing_max_body_bytes ?? ''} onChange={(e) => update('tracing_max_body_bytes', parseInt(e.target.value) || 0)} placeholder="65536" disabled={!settings.enable_tracing} />
</SettingRow>
<SettingRow label="Enable Backend Logging" description="Capture backend process output per model (without requiring debug mode)">
<Toggle checked={settings.enable_backend_logging} onChange={(v) => update('enable_backend_logging', v)} />
</SettingRow>

View File

@@ -220,7 +220,10 @@ function BackendTraceDetail({ trace }) {
</div>
)}
{/* Backend logs link */}
{/* Backend logs link — /app/backend-logs/:modelId is the unified entry
point: in standalone mode it streams local logs, in distributed mode
it resolves the model to the host worker(s) and either redirects to
/app/node-backend-logs/<nodeId>/<modelId> or shows a node picker. */}
{trace.model_name && (
<div style={{ marginBottom: 'var(--spacing-md)' }}>
<a
@@ -406,7 +409,15 @@ export default function Traces() {
<button className="btn btn-secondary btn-sm" onClick={fetchTraces}><i className="fas fa-rotate" /> Refresh</button>
<button className="btn btn-secondary btn-sm" onClick={handleExport} disabled={traces.length === 0}><i className="fas fa-download" /> Export</button>
<div style={{ flex: 1 }} />
<button className="btn btn-danger btn-sm" onClick={handleClear} disabled={traces.length === 0}><i className="fas fa-trash" /> Clear</button>
<button
className="btn btn-danger btn-sm"
onClick={handleClear}
/* Stay enabled while loading: a massive in-memory trace buffer is
precisely the case where the user can't see the table yet and
needs Clear to recover. Clearing an already-empty server-side
buffer is a harmless no-op. */
disabled={!loading && traces.length === 0}
><i className="fas fa-trash" /> Clear</button>
</div>
{settings && (() => {
@@ -459,6 +470,17 @@ export default function Traces() {
disabled={!settings.enable_tracing}
/>
</SettingRow>
<SettingRow label="Max Body Bytes" description="Per-field cap for captured bodies and backend trace Data (0 = uncapped). Prevents oversized LLM histories or TTS snippets from locking this page in loading.">
<input
className="input"
type="number"
style={{ width: 120 }}
value={settings.tracing_max_body_bytes ?? ''}
onChange={(e) => setSettings(prev => ({ ...prev, tracing_max_body_bytes: parseInt(e.target.value) || 0 }))}
placeholder="65536"
disabled={!settings.enable_tracing}
/>
</SettingRow>
<SettingRow label="Enable Backend Logging" description="Capture backend process output per model (without requiring debug mode)">
<Toggle
checked={settings.enable_backend_logging}

View File

@@ -4,6 +4,7 @@ import { useTranslation } from 'react-i18next'
import { useAuth } from '../context/AuthContext'
import { apiUrl } from '../utils/basePath'
import LoadingSpinner from '../components/LoadingSpinner'
import SourcesTab from './Usage/SourcesTab'
const PERIODS = [
{ key: 'day', label: 'Day' },
@@ -724,23 +725,27 @@ export default function Usage() {
{p.label}
</button>
))}
<div style={{ width: 1, height: 20, background: 'var(--color-border-subtle)', margin: '0 var(--spacing-xs)' }} />
<button
className={`btn btn-sm ${activeTab === 'models' ? 'btn-primary' : 'btn-secondary'}`}
onClick={() => setActiveTab('models')}
>
<i className="fas fa-cube" style={{ fontSize: '0.7rem' }} /> Models
</button>
{isAdmin && (
<>
<div style={{ width: 1, height: 20, background: 'var(--color-border-subtle)', margin: '0 var(--spacing-xs)' }} />
<button
className={`btn btn-sm ${activeTab === 'models' ? 'btn-primary' : 'btn-secondary'}`}
onClick={() => setActiveTab('models')}
>
<i className="fas fa-cube" style={{ fontSize: '0.7rem' }} /> Models
</button>
<button
className={`btn btn-sm ${activeTab === 'users' ? 'btn-primary' : 'btn-secondary'}`}
onClick={() => setActiveTab('users')}
>
<i className="fas fa-users" style={{ fontSize: '0.7rem' }} /> Users
</button>
</>
<button
className={`btn btn-sm ${activeTab === 'users' ? 'btn-primary' : 'btn-secondary'}`}
onClick={() => setActiveTab('users')}
>
<i className="fas fa-users" style={{ fontSize: '0.7rem' }} /> Users
</button>
)}
<button
className={`btn btn-sm ${activeTab === 'sources' ? 'btn-primary' : 'btn-secondary'}`}
onClick={() => setActiveTab('sources')}
>
<i className="fas fa-key" style={{ fontSize: '0.7rem' }} /> {t('usage.sources.tab')}
</button>
<div style={{ flex: 1 }} />
<button className="btn btn-secondary btn-sm" onClick={fetchUsage} disabled={loading} style={{ gap: 4 }}>
<i className={`fas fa-rotate${loading ? ' fa-spin' : ''}`} /> Refresh
@@ -884,6 +889,10 @@ export default function Usage() {
</div>
)
)}
{activeTab === 'sources' && (
<SourcesTab period={period} adminUserId={selectedUserId} />
)}
</>
)}
</div>

View File

@@ -0,0 +1,83 @@
import { useTranslation } from 'react-i18next'
const SEGMENT_COLORS = {
apikey: 'var(--color-primary)',
web: 'var(--color-info, #3b82f6)',
legacy: 'var(--color-warning, #f59e0b)',
}
// SourceMixRibbon renders one segmented horizontal bar showing the share of
// tokens by source class (apikey / web / legacy). Clicking a segment invokes
// onSelectSourceClass with the segment key so the parent can filter the view.
//
// Props:
// bySource: { apikey?: {tokens, requests}, web?: {...}, legacy?: {...} }
// keyCount: number of distinct API keys in the dataset (for the legend)
// onSelectSourceClass: (cls: 'apikey'|'web'|'legacy') => void (optional)
export default function SourceMixRibbon({ bySource = {}, keyCount = 0, onSelectSourceClass }) {
const { t } = useTranslation('admin')
const apikey = (bySource.apikey?.tokens) || 0
const web = (bySource.web?.tokens) || 0
const legacy = (bySource.legacy?.tokens) || 0
const total = apikey + web + legacy || 1
const pct = (n) => Math.round((n / total) * 100)
const apiPct = pct(apikey)
const webPct = pct(web)
const legacyPct = pct(legacy)
const segments = [
{ key: 'apikey', label: `${apiPct}% API keys (${keyCount})`, pct: apiPct, color: SEGMENT_COLORS.apikey },
{ key: 'web', label: `${webPct}% ${t('usage.sources.webUI')}`, pct: webPct, color: SEGMENT_COLORS.web },
{ key: 'legacy', label: `${legacyPct}% ${t('usage.sources.legacy')}`, pct: legacyPct, color: SEGMENT_COLORS.legacy },
].filter((s) => s.pct > 0)
return (
<div
role="group"
aria-label={t('usage.sources.ribbonAria', { apikey: apiPct, web: webPct, legacy: legacyPct })}
style={{ display: 'flex', flexDirection: 'column', gap: 'var(--spacing-xs)' }}
>
<div style={{ fontSize: '0.875rem', fontWeight: 600, color: 'var(--color-text-primary)' }}>
{t('usage.sources.mixTitle')}
</div>
<div
style={{
display: 'flex',
height: 12,
borderRadius: 'var(--radius-sm)',
overflow: 'hidden',
border: '1px solid var(--color-border-subtle)',
}}
>
{segments.map((s) => (
<button
key={s.key}
type="button"
onClick={() => onSelectSourceClass?.(s.key)}
aria-label={s.label}
style={{
width: `${s.pct}%`,
background: s.color,
border: 'none',
padding: 0,
cursor: onSelectSourceClass ? 'pointer' : 'default',
}}
/>
))}
</div>
<div style={{ display: 'flex', flexWrap: 'wrap', gap: 'var(--spacing-sm)', fontSize: '0.75rem' }}>
{segments.map((s) => (
<span key={s.key} style={{ display: 'inline-flex', alignItems: 'center', gap: 6 }}>
<span
style={{ width: 10, height: 10, borderRadius: 2, background: s.color, display: 'inline-block' }}
aria-hidden
/>
{s.label}
</span>
))}
</div>
</div>
)
}

View File

@@ -0,0 +1,147 @@
import { useMemo } from 'react'
import { useTranslation } from 'react-i18next'
const TOP_N = 7
// Distinct, accessible-ish series colors that read on both light and dark themes.
const SERIES_COLORS = [
'var(--color-primary)',
'var(--color-success, #10b981)',
'var(--color-warning, #f59e0b)',
'var(--color-info, #3b82f6)',
'var(--color-danger, #ef4444)',
'#a855f7',
'#ec4899',
]
const OTHER_COLOR = 'var(--color-text-muted, #94a3b8)'
function identityFor(bucket) {
return bucket.api_key_id || bucket.source || 'unknown'
}
// buckets: UsageBucket[] from /api/auth/usage/sources (server-sorted ASC by bucket)
// selectedKey: 'web' | 'legacy' | api_key_id | null
// totals: SourceTotals (for the "Other (count)" legend label)
export default function SourceTimeChart({ buckets = [], selectedKey, totals }) {
const { t } = useTranslation('admin')
// Find the top-N identities by total tokens across the period.
const topIds = useMemo(() => {
const sums = new Map()
for (const b of buckets) {
const id = identityFor(b)
sums.set(id, (sums.get(id) || 0) + (b.total_tokens || 0))
}
return [...sums.entries()]
.sort((a, b) => b[1] - a[1])
.slice(0, TOP_N)
.map(([id]) => id)
}, [buckets])
const topSet = useMemo(() => new Set(topIds), [topIds])
// Resolve a display label for an identity (api_key_id -> snapshotted name, or source name).
const labelByIdentity = useMemo(() => {
const m = new Map()
for (const b of buckets) {
const id = identityFor(b)
if (m.has(id)) continue
if (b.source === 'web') { m.set(id, t('usage.sources.webUI')); continue }
if (b.source === 'legacy') { m.set(id, t('usage.sources.legacy')); continue }
m.set(id, b.api_key_name || b.api_key_id || id)
}
return m
}, [buckets, t])
// Build a dense per-bucket row, splitting top-N vs Other.
const series = useMemo(() => {
const byBucket = new Map()
for (const b of buckets) {
const id = identityFor(b)
const seriesId = topSet.has(id) ? id : '__other__'
const row = byBucket.get(b.bucket) || { bucket: b.bucket, total: 0 }
row[seriesId] = (row[seriesId] || 0) + (b.total_tokens || 0)
row.total += b.total_tokens || 0
byBucket.set(b.bucket, row)
}
return [...byBucket.values()]
}, [buckets, topSet])
const max = useMemo(
() => series.reduce((m, r) => Math.max(m, r.total), 0) || 1,
[series]
)
const seriesIds = [...topIds, '__other__']
const colorOf = (id) =>
id === '__other__'
? OTHER_COLOR
: SERIES_COLORS[topIds.indexOf(id) % SERIES_COLORS.length]
const labelOfId = (id) => {
if (id === '__other__') return null // computed inline (need count)
return labelByIdentity.get(id) || id
}
const otherCount = Math.max(0, (totals?.by_key?.length || 0) - TOP_N)
// SVG geometry: 24px wide per bar (2px gap), 100px tall, viewBox stretches with bar count.
const barWidth = 20
const barGap = 4
const slotWidth = barWidth + barGap
const height = 100
const width = Math.max(series.length * slotWidth, 200)
return (
<div style={{ display: 'flex', flexDirection: 'column', gap: 'var(--spacing-xs)' }}>
<div style={{ fontSize: '0.875rem', fontWeight: 600, color: 'var(--color-text-primary)' }}>
{t('usage.sources.topSources')}
</div>
<svg
viewBox={`0 0 ${width} ${height}`}
preserveAspectRatio="none"
style={{ width: '100%', height: 160, display: 'block' }}
aria-hidden
>
{series.map((row, i) => {
let y = height
return (
<g key={row.bucket} transform={`translate(${i * slotWidth}, 0)`}>
{seriesIds.map(id => {
const v = row[id] || 0
if (!v) return null
const h = (v / max) * height
y -= h
const dim = selectedKey && selectedKey !== id ? 0.25 : 1
const title = id === '__other__'
? t('usage.sources.other', { count: otherCount })
: labelOfId(id)
return (
<rect
key={id}
x={barGap / 2} y={y}
width={barWidth} height={h}
fill={colorOf(id)} opacity={dim}
>
<title>{`${row.bucket} - ${title}: ${v.toLocaleString()}`}</title>
</rect>
)
})}
</g>
)
})}
</svg>
<div style={{ display: 'flex', flexWrap: 'wrap', gap: 'var(--spacing-sm)', fontSize: '0.75rem' }}>
{seriesIds.map(id => (
<span key={id} style={{ display: 'inline-flex', alignItems: 'center', gap: 6 }}>
<span style={{ width: 10, height: 10, borderRadius: 2, background: colorOf(id), display: 'inline-block' }} aria-hidden />
{id === '__other__'
? t('usage.sources.other', { count: otherCount })
: labelOfId(id)}
</span>
))}
</div>
</div>
)
}

View File

@@ -0,0 +1,176 @@
import { useEffect, useState } from 'react'
import { useTranslation } from 'react-i18next'
import { usageApi, apiKeysApi } from '../../utils/api'
import { useAuth } from '../../context/AuthContext'
import LoadingSpinner from '../../components/LoadingSpinner'
import SourceMixRibbon from './SourceMixRibbon'
import SourcesTable from './SourcesTable'
import SourceTimeChart from './SourceTimeChart'
const EMPTY_DATA = {
buckets: [],
totals: { by_source: {}, by_key: [], grand_total: { tokens: 0, requests: 0 } },
truncated: false,
}
// Resolve a human label for the currently selected key (web/legacy class or api_key_id).
function labelForSelected(totals, selectedKey, t) {
if (!selectedKey) return ''
if (selectedKey === 'web') return t('usage.sources.webUI')
if (selectedKey === 'legacy') return t('usage.sources.legacy')
const row = (totals?.by_key || []).find(k => k.api_key_id === selectedKey)
return row ? (row.api_key_name || selectedKey) : selectedKey
}
// SourcesTab fetches and renders per-source / per-API-key usage breakdown.
// Task 10 replaces the raw JSON / list placeholders with SourceMixRibbon and
// SourcesTable. Task 11 will add the time chart and drill-in chip.
export default function SourcesTab({ period, adminUserId }) {
const { t } = useTranslation('admin')
const { isAdmin } = useAuth()
const [data, setData] = useState(EMPTY_DATA)
const [loading, setLoading] = useState(false)
const [error, setError] = useState(null)
const [selectedKey, setSelectedKey] = useState(null)
const [search, setSearch] = useState('')
const [sortKey, setSortKey] = useState('tokens')
// Pull the current set of API key ids so the table can mark unknown keys as
// revoked. null = "don't know yet" so the table won't dim live keys during
// the fetch or after a failure.
const [existingKeyIds, setExistingKeyIds] = useState(null)
useEffect(() => {
apiKeysApi
.list()
.then((resp) => {
const list = Array.isArray(resp) ? resp : (resp?.keys || [])
setExistingKeyIds(new Set(list.map((k) => k.id)))
})
.catch(() => { /* leave existingKeyIds null so revoked detection is skipped */ })
}, [])
useEffect(() => {
let cancelled = false
setLoading(true)
setError(null)
const p = isAdmin
? usageApi.getAdminSources(period, adminUserId)
: usageApi.getMySources(period)
p
.then((d) => { if (!cancelled) setData(d || EMPTY_DATA) })
.catch((e) => { if (!cancelled) setError(e) })
.finally(() => { if (!cancelled) setLoading(false) })
return () => { cancelled = true }
}, [isAdmin, period, adminUserId])
const totals = data.totals || EMPTY_DATA.totals
const buckets = data.buckets || EMPTY_DATA.buckets
const grandT = totals.grand_total || { tokens: 0, requests: 0 }
const truncated = data.truncated || false
const isEmpty = !loading && (grandT.tokens || 0) === 0 && (grandT.requests || 0) === 0
if (loading) {
return (
<div style={{ display: 'flex', justifyContent: 'center', padding: 'var(--spacing-xl)' }}>
<LoadingSpinner size="lg" />
</div>
)
}
if (error) {
return (
<div className="empty-state">
<div className="empty-state-icon"><i className="fas fa-triangle-exclamation" /></div>
<h2 className="empty-state-title">Failed to load</h2>
<p className="empty-state-text">{String(error.message || error)}</p>
</div>
)
}
if (isEmpty) {
return (
<div className="empty-state">
<div className="empty-state-icon"><i className="fas fa-key" /></div>
<h2 className="empty-state-title">{t('usage.sources.noTrafficShort')}</h2>
<p className="empty-state-text">{t('usage.sources.noKeysYet')}</p>
</div>
)
}
return (
<div style={{ display: 'flex', flexDirection: 'column', gap: 'var(--spacing-md)' }}>
<div className="card" style={{ padding: 'var(--spacing-md)' }}>
<SourceMixRibbon
bySource={totals.by_source}
keyCount={(totals.by_key || []).length}
onSelectSourceClass={(cls) => setSelectedKey(cls)}
/>
</div>
{selectedKey && (
<div style={{ display: 'flex', alignItems: 'center', gap: 'var(--spacing-xs)' }}>
<span
style={{
display: 'inline-flex',
alignItems: 'center',
gap: 'var(--spacing-xs)',
padding: 'calc(var(--spacing-xs) / 2) var(--spacing-sm)',
background: 'var(--color-bg-secondary)',
color: 'var(--color-text-primary)',
fontSize: '0.75rem',
borderRadius: 'var(--radius-sm)',
border: '1px solid var(--color-border-subtle)',
}}
>
<i className="fas fa-filter" style={{ fontSize: '0.6875rem', color: 'var(--color-text-muted)' }} aria-hidden />
{t('usage.sources.filteredTo', { name: labelForSelected(totals, selectedKey, t) })}
<button
type="button"
onClick={() => setSelectedKey(null)}
aria-label={t('usage.sources.clearFilter')}
style={{
appearance: 'none',
background: 'transparent',
border: 'none',
color: 'var(--color-text-muted)',
cursor: 'pointer',
padding: 0,
fontSize: '0.875rem',
lineHeight: 1,
}}
>
<i className="fas fa-xmark" />
</button>
</span>
</div>
)}
<div className="card" style={{ padding: 'var(--spacing-md)' }}>
<SourceTimeChart buckets={buckets} selectedKey={selectedKey} totals={totals} />
</div>
<div className="card" style={{ padding: 'var(--spacing-md)' }}>
<SourcesTable
totals={totals}
selectedKey={selectedKey}
onSelectKey={setSelectedKey}
search={search}
setSearch={setSearch}
sortKey={sortKey}
setSortKey={setSortKey}
existingKeyIds={existingKeyIds}
showUserColumn={isAdmin}
/>
</div>
{truncated && (
<div style={{ fontSize: '0.75rem', color: 'var(--color-warning)' }}>
{t('usage.sources.truncatedWarning')}
</div>
)}
</div>
)
}

View File

@@ -0,0 +1,245 @@
import { useMemo } from 'react'
import { useTranslation } from 'react-i18next'
const SORT_FNS = {
tokens: (a, b) => (b.tokens || 0) - (a.tokens || 0),
requests: (a, b) => (b.requests || 0) - (a.requests || 0),
last_used: (a, b) => new Date(b.last_used || 0).getTime() - new Date(a.last_used || 0).getTime(),
name: (a, b) => (a.name || '').localeCompare(b.name || ''),
user: (a, b) => (a.userName || '').localeCompare(b.userName || ''),
}
function formatTokens(n) {
if (!n) return '0'
if (n >= 1_000_000) return (n / 1_000_000).toFixed(1) + 'M'
if (n >= 1_000) return (n / 1_000).toFixed(1) + 'k'
return String(n)
}
function formatRelative(iso) {
if (!iso) return '-'
const t = new Date(iso).getTime()
if (Number.isNaN(t) || t <= 0) return '-'
const diff = Date.now() - t
if (diff < 60_000) return 'just now'
if (diff < 3_600_000) return Math.round(diff / 60_000) + 'm ago'
if (diff < 86_400_000) return Math.round(diff / 3_600_000) + 'h ago'
return Math.round(diff / 86_400_000) + 'd ago'
}
// SourcesTable is the searchable, sortable list of key totals plus pseudo-rows
// for the web UI and legacy (unkeyed) source classes. Clicking a row selects
// it; the parent decides what to do with the selection (the drill-in panel
// will be wired in Task 11).
//
// Props:
// totals: SourceTotals payload (from /api/auth/usage/sources)
// selectedKey: currently-selected row id (api_key_id | 'web' | 'legacy' | null)
// onSelectKey: (id|null) => void
// search / setSearch: free-text filter state lifted to the parent
// sortKey / setSortKey: sort column state lifted to the parent
// existingKeyIds: Set<string> of current (non-revoked) api key ids, or null
// when the parent hasn't yet learned which keys exist. Null suppresses the
// revoked badge entirely so live keys aren't dimmed during the fetch or
// after a failure.
// showUserColumn: render the User column. Admin views set this true so the
// reader can attribute each key (and each Web UI row) to its owner.
export default function SourcesTable({
totals,
selectedKey,
onSelectKey,
search,
setSearch,
sortKey,
setSortKey,
existingKeyIds = null,
showUserColumn = false,
}) {
const { t } = useTranslation('admin')
const rows = useMemo(() => {
const named = (totals?.by_key || []).map((k) => ({
kind: 'apikey',
id: k.api_key_id,
name: k.api_key_name || k.api_key_id,
userID: k.user_id || '',
userName: k.user_name || '',
prefix: '',
tokens: k.tokens,
requests: k.requests,
last_used: k.last_used,
revoked: existingKeyIds != null && !existingKeyIds.has(k.api_key_id),
}))
// Pseudo-rows for sources that don't have a named key identity.
// In admin view (showUserColumn=true), prefer the per-user breakdown
// from totals.by_user_source so each user's Web UI / legacy traffic
// gets its own row. Otherwise fall back to the global by_source aggregate.
let unkeyed = []
if (showUserColumn && Array.isArray(totals?.by_user_source) && totals.by_user_source.length > 0) {
unkeyed = totals.by_user_source.map((r) => ({
kind: r.source,
id: r.source + ':' + (r.user_id || ''),
name: r.source === 'legacy' ? t('usage.sources.legacy') : t('usage.sources.webUI'),
userID: r.user_id || '',
userName: r.user_name || '',
prefix: '-',
tokens: r.tokens,
requests: r.requests,
}))
} else {
if (totals?.by_source?.web) {
unkeyed.push({
kind: 'web',
id: 'web',
name: t('usage.sources.webUI'),
userID: '',
userName: '',
prefix: '-',
tokens: totals.by_source.web.tokens,
requests: totals.by_source.web.requests,
})
}
if (totals?.by_source?.legacy) {
unkeyed.push({
kind: 'legacy',
id: 'legacy',
name: t('usage.sources.legacy'),
userID: '',
userName: '',
prefix: '-',
tokens: totals.by_source.legacy.tokens,
requests: totals.by_source.legacy.requests,
})
}
}
return [...named, ...unkeyed]
}, [totals, existingKeyIds, showUserColumn, t])
const filtered = useMemo(() => {
const q = (search || '').trim().toLowerCase()
const list = q
? rows.filter((r) =>
(r.name || '').toLowerCase().includes(q) ||
(r.prefix || '').toLowerCase().includes(q) ||
(r.userName || '').toLowerCase().includes(q) ||
(r.userID || '').toLowerCase().includes(q)
)
: rows
return [...list].sort(SORT_FNS[sortKey] || SORT_FNS.tokens)
}, [rows, search, sortKey])
const iconFor = (kind) =>
kind === 'apikey' ? 'fas fa-key' : kind === 'web' ? 'fas fa-globe' : 'fas fa-gear'
return (
<div style={{ display: 'flex', flexDirection: 'column', gap: 'var(--spacing-sm)' }}>
<div style={{ display: 'flex', alignItems: 'center', gap: 'var(--spacing-sm)', flexWrap: 'wrap' }}>
<input
type="search"
value={search}
onChange={(e) => setSearch(e.target.value)}
placeholder={t('usage.sources.searchPlaceholder')}
aria-label={t('usage.sources.searchPlaceholder')}
style={{
flex: '1 1 12rem',
minWidth: 160,
padding: 'var(--spacing-xs) var(--spacing-sm)',
border: '1px solid var(--color-border-subtle)',
borderRadius: 'var(--radius-sm)',
background: 'var(--color-bg-primary)',
color: 'var(--color-text-primary)',
}}
/>
<label style={{ display: 'inline-flex', alignItems: 'center', gap: 6, fontSize: '0.75rem' }}>
{t('usage.sources.sortBy')}:
<select
value={sortKey}
onChange={(e) => setSortKey(e.target.value)}
style={{
padding: 'calc(var(--spacing-xs) / 2) var(--spacing-xs)',
border: '1px solid var(--color-border-subtle)',
borderRadius: 'var(--radius-sm)',
background: 'var(--color-bg-primary)',
color: 'var(--color-text-primary)',
}}
>
<option value="tokens">{t('usage.sources.sortTokens')}</option>
<option value="requests">{t('usage.sources.sortRequests')}</option>
<option value="last_used">{t('usage.sources.sortLastUsed')}</option>
<option value="name">{t('usage.sources.sortName')}</option>
{showUserColumn && <option value="user">{t('usage.sources.sortUser')}</option>}
</select>
</label>
</div>
<div className="table-container">
<table className="table">
<thead>
<tr>
<th>{t('usage.sources.sortName')}</th>
{showUserColumn && <th style={{ width: 180 }}>{t('usage.sources.sortUser')}</th>}
<th style={{ width: 110 }}>Prefix</th>
<th style={{ width: 100, textAlign: 'right' }}>{t('usage.sources.sortRequests')}</th>
<th style={{ width: 100, textAlign: 'right' }}>{t('usage.sources.sortTokens')}</th>
<th style={{ width: 120, textAlign: 'right' }}>{t('usage.sources.sortLastUsed')}</th>
</tr>
</thead>
<tbody>
{filtered.map((r) => {
const isSel = selectedKey === r.id
return (
<tr
key={r.id}
onClick={() => onSelectKey?.(isSel ? null : r.id)}
style={{
cursor: 'pointer',
background: isSel ? 'var(--color-bg-secondary)' : undefined,
opacity: r.revoked ? 0.5 : 1,
}}
>
<td>
<span style={{ display: 'inline-flex', alignItems: 'center', gap: 8 }}>
<i
className={iconFor(r.kind)}
style={{ color: 'var(--color-text-muted)', fontSize: '0.8125rem' }}
/>
<span>{r.name}</span>
{r.revoked && (
<span
style={{
fontSize: '0.6875rem',
textTransform: 'uppercase',
color: 'var(--color-text-muted)',
}}
>
({t('usage.sources.revoked')})
</span>
)}
</span>
</td>
{showUserColumn && (
<td style={{ color: 'var(--color-text-secondary)', fontSize: '0.8125rem' }}>
{r.userName || r.userID || '-'}
</td>
)}
<td style={{ color: 'var(--color-text-muted)', fontSize: '0.75rem' }}>{r.prefix || '-'}</td>
<td style={{ textAlign: 'right', fontFamily: 'var(--font-mono)' }}>
{Number(r.requests || 0).toLocaleString()}
</td>
<td style={{ textAlign: 'right', fontFamily: 'var(--font-mono)' }}>
{formatTokens(r.tokens || 0)}
</td>
<td style={{ textAlign: 'right', fontSize: '0.75rem', color: 'var(--color-text-muted)' }}>
{formatRelative(r.last_used)}
</td>
</tr>
)
})}
</tbody>
</table>
</div>
</div>
)
}

View File

@@ -422,6 +422,14 @@ export const usageApi = {
if (userId) url += `&user_id=${encodeURIComponent(userId)}`
return fetchJSON(url)
},
getMySources: (period) =>
fetchJSON(`/api/auth/usage/sources?period=${period || 'month'}`),
getAdminSources: (period, userId, apiKeyId) => {
let url = `/api/auth/admin/usage/sources?period=${period || 'month'}`
if (userId) url += `&user_id=${encodeURIComponent(userId)}`
if (apiKeyId) url += `&api_key_id=${encodeURIComponent(apiKeyId)}`
return fetchJSON(url)
},
getMyQuotas: () => fetchJSON('/api/auth/quota'),
}

View File

@@ -789,6 +789,30 @@ func RegisterAuthRoutes(e *echo.Echo, app *application.Application) {
})
})
// GET /api/auth/usage/sources - caller's per-source breakdown (no legacy)
e.GET("/api/auth/usage/sources", func(c echo.Context) error {
user := auth.GetUser(c)
if user == nil {
return c.JSON(http.StatusUnauthorized, map[string]string{"error": "not authenticated"})
}
period := c.QueryParam("period")
if period == "" {
period = "month"
}
buckets, totals, err := auth.GetUserUsageBySource(db, user.ID, period)
if err != nil {
return c.JSON(http.StatusInternalServerError, map[string]string{"error": "failed to get usage"})
}
return c.JSON(http.StatusOK, map[string]any{
"buckets": buckets,
"totals": totals,
"truncated": false,
})
})
// Admin endpoints
adminMw := auth.RequireAdmin()
@@ -1104,6 +1128,27 @@ func RegisterAuthRoutes(e *echo.Echo, app *application.Application) {
})
}, adminMw)
// GET /api/auth/admin/usage/sources - all users' per-source breakdown (admin only)
e.GET("/api/auth/admin/usage/sources", func(c echo.Context) error {
period := c.QueryParam("period")
if period == "" {
period = "month"
}
userID := c.QueryParam("user_id")
apiKeyID := c.QueryParam("api_key_id")
buckets, totals, truncated, err := auth.GetAllUsageBySource(db, period, userID, apiKeyID)
if err != nil {
return c.JSON(http.StatusInternalServerError, map[string]string{"error": "failed to get usage"})
}
return c.JSON(http.StatusOK, map[string]any{
"buckets": buckets,
"totals": totals,
"truncated": truncated,
})
}, adminMw)
// --- Invite management endpoints ---
// POST /api/auth/admin/invites - create invite (admin only)

View File

@@ -286,6 +286,45 @@ func newTestAuthApp(db *gorm.DB, appConfig *config.ApplicationConfig) *echo.Echo
return c.JSON(http.StatusOK, map[string]string{"message": "user deleted"})
}, adminMw)
// Mirror of production handler in routes/auth.go GET /api/auth/usage/sources.
// Keep this body in sync with the real handler; this test app cannot call
// RegisterAuthRoutes because it needs a *application.Application.
e.GET("/api/auth/usage/sources", func(c echo.Context) error {
user := auth.GetUser(c)
if user == nil {
return c.JSON(http.StatusUnauthorized, map[string]string{"error": "not authenticated"})
}
period := c.QueryParam("period")
if period == "" {
period = "month"
}
buckets, totals, err := auth.GetUserUsageBySource(db, user.ID, period)
if err != nil {
return c.JSON(http.StatusInternalServerError, map[string]string{"error": "failed to get usage"})
}
return c.JSON(http.StatusOK, map[string]any{
"buckets": buckets, "totals": totals, "truncated": false,
})
})
// Mirror of production handler in routes/auth.go GET /api/auth/admin/usage/sources.
// Keep this body in sync with the real handler.
e.GET("/api/auth/admin/usage/sources", func(c echo.Context) error {
period := c.QueryParam("period")
if period == "" {
period = "month"
}
userID := c.QueryParam("user_id")
apiKeyID := c.QueryParam("api_key_id")
buckets, totals, truncated, err := auth.GetAllUsageBySource(db, period, userID, apiKeyID)
if err != nil {
return c.JSON(http.StatusInternalServerError, map[string]string{"error": "failed to get usage"})
}
return c.JSON(http.StatusOK, map[string]any{
"buckets": buckets, "totals": totals, "truncated": truncated,
})
}, adminMw)
// Regular API endpoint for testing
e.POST("/v1/chat/completions", func(c echo.Context) error {
return c.String(http.StatusOK, "ok")
@@ -931,4 +970,110 @@ var _ = Describe("Auth Routes", Label("auth"), func() {
Expect(providers).To(ContainElement(auth.ProviderGitHub))
})
})
Describe("GET /api/auth/usage/sources", func() {
It("returns only the caller's data, never legacy", func() {
app := newTestAuthApp(db, appConfig)
alice := createRouteTestUser(db, "alice@example.com", auth.RoleUser)
aliceToken, err := auth.CreateSession(db, alice.ID, "")
Expect(err).ToNot(HaveOccurred())
keyID := "k-alice"
now := time.Now()
Expect(auth.RecordUsage(db, &auth.UsageRecord{
UserID: alice.ID, Source: auth.UsageSourceAPIKey,
APIKeyID: &keyID, APIKeyName: "alice-key",
Model: "gpt-4", TotalTokens: 100, CreatedAt: now,
})).To(Succeed())
Expect(auth.RecordUsage(db, &auth.UsageRecord{
UserID: alice.ID, Source: auth.UsageSourceWeb,
Model: "gpt-4", TotalTokens: 50, CreatedAt: now,
})).To(Succeed())
Expect(auth.RecordUsage(db, &auth.UsageRecord{
UserID: "legacy-api-key", Source: auth.UsageSourceLegacy,
Model: "gpt-4", TotalTokens: 30, CreatedAt: now,
})).To(Succeed())
rec := doAuthRequest(app, http.MethodGet, "/api/auth/usage/sources?period=month", nil, withSession(aliceToken))
Expect(rec.Code).To(Equal(http.StatusOK))
var resp struct {
Buckets []auth.UsageBucket `json:"buckets"`
Totals auth.SourceTotals `json:"totals"`
Truncated bool `json:"truncated"`
}
Expect(json.Unmarshal(rec.Body.Bytes(), &resp)).To(Succeed())
_, hasLegacy := resp.Totals.BySource[auth.UsageSourceLegacy]
Expect(hasLegacy).To(BeFalse())
Expect(resp.Totals.GrandTotal.Tokens).To(Equal(int64(150)))
Expect(resp.Truncated).To(BeFalse())
})
It("returns 401 when unauthenticated", func() {
app := newTestAuthApp(db, appConfig)
// Without a session cookie or bearer token, the global auth middleware
// should refuse the request before our handler runs.
rec := doAuthRequest(app, http.MethodGet, "/api/auth/usage/sources?period=month", nil)
Expect(rec.Code).To(Equal(http.StatusUnauthorized))
})
})
Describe("GET /api/auth/admin/usage/sources", func() {
It("returns 403 for non-admin", func() {
app := newTestAuthApp(db, appConfig)
alice := createRouteTestUser(db, "alice@example.com", auth.RoleUser)
aliceToken, _ := auth.CreateSession(db, alice.ID, "")
rec := doAuthRequest(app, http.MethodGet, "/api/auth/admin/usage/sources?period=month", nil, withSession(aliceToken))
Expect(rec.Code).To(Equal(http.StatusForbidden))
})
It("returns legacy bucket for admin and applies api_key_id filter", func() {
app := newTestAuthApp(db, appConfig)
admin := createRouteTestUser(db, "admin@example.com", auth.RoleAdmin)
adminToken, _ := auth.CreateSession(db, admin.ID, "")
k1 := "k1"
k2 := "k2"
now := time.Now()
Expect(auth.RecordUsage(db, &auth.UsageRecord{UserID: "alice", Source: auth.UsageSourceAPIKey, APIKeyID: &k1, APIKeyName: "ci", Model: "gpt-4", TotalTokens: 10, CreatedAt: now})).To(Succeed())
Expect(auth.RecordUsage(db, &auth.UsageRecord{UserID: "alice", Source: auth.UsageSourceAPIKey, APIKeyID: &k2, APIKeyName: "lap", Model: "gpt-4", TotalTokens: 20, CreatedAt: now})).To(Succeed())
Expect(auth.RecordUsage(db, &auth.UsageRecord{UserID: "legacy-api-key", Source: auth.UsageSourceLegacy, Model: "gpt-4", TotalTokens: 5, CreatedAt: now})).To(Succeed())
rec := doAuthRequest(app, http.MethodGet,
"/api/auth/admin/usage/sources?period=month&api_key_id=k2", nil, withSession(adminToken))
Expect(rec.Code).To(Equal(http.StatusOK))
var resp struct {
Totals auth.SourceTotals `json:"totals"`
Truncated bool `json:"truncated"`
}
Expect(json.Unmarshal(rec.Body.Bytes(), &resp)).To(Succeed())
Expect(resp.Totals.GrandTotal.Tokens).To(Equal(int64(20)))
})
It("includes legacy in by_source for admin with no filter", func() {
app := newTestAuthApp(db, appConfig)
admin := createRouteTestUser(db, "admin@example.com", auth.RoleAdmin)
adminToken, _ := auth.CreateSession(db, admin.ID, "")
now := time.Now()
Expect(auth.RecordUsage(db, &auth.UsageRecord{UserID: "legacy-api-key", Source: auth.UsageSourceLegacy, Model: "gpt-4", TotalTokens: 7, CreatedAt: now})).To(Succeed())
rec := doAuthRequest(app, http.MethodGet, "/api/auth/admin/usage/sources?period=month", nil, withSession(adminToken))
Expect(rec.Code).To(Equal(http.StatusOK))
var resp struct {
Totals auth.SourceTotals `json:"totals"`
}
Expect(json.Unmarshal(rec.Body.Bytes(), &resp)).To(Succeed())
Expect(resp.Totals.BySource).To(HaveKey(auth.UsageSourceLegacy))
Expect(resp.Totals.BySource[auth.UsageSourceLegacy].Tokens).To(Equal(int64(7)))
})
})
})

View File

@@ -6,7 +6,9 @@ import (
"strings"
"github.com/labstack/echo/v4"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/endpoints/localai"
"github.com/mudler/LocalAI/core/services/galleryop"
"github.com/mudler/LocalAI/core/services/nodes"
"gorm.io/gorm"
)
@@ -53,7 +55,12 @@ func RegisterNodeSelfServiceRoutes(e *echo.Echo, registry *nodes.NodeRegistry, r
// RegisterNodeAdminRoutes registers /api/nodes/ endpoints used by admins
// (list, get, get models, drain, delete, approve, backend management). Protected by admin middleware.
func RegisterNodeAdminRoutes(e *echo.Echo, registry *nodes.NodeRegistry, unloader nodes.NodeCommandSender, adminMw echo.MiddlewareFunc, authDB *gorm.DB, hmacSecret string, registrationToken string) {
//
// galleryService/opcache/appConfig are threaded in for the async node-scoped
// backend install path (POST /:id/backends/install). That handler enqueues a
// ManagementOp on the gallery channel rather than blocking on a NATS reply, so
// the browser gets HTTP 202 + jobID immediately instead of waiting up to 3 minutes.
func RegisterNodeAdminRoutes(e *echo.Echo, registry *nodes.NodeRegistry, unloader nodes.NodeCommandSender, galleryService *galleryop.GalleryService, opcache *galleryop.OpCache, appConfig *config.ApplicationConfig, adminMw echo.MiddlewareFunc, authDB *gorm.DB, hmacSecret string, registrationToken string) {
if registry == nil {
return
}
@@ -78,7 +85,7 @@ func RegisterNodeAdminRoutes(e *echo.Echo, registry *nodes.NodeRegistry, unloade
// Backend management on workers
admin.GET("/:id/backends", localai.ListBackendsOnNodeEndpoint(unloader))
admin.POST("/:id/backends/install", localai.InstallBackendOnNodeEndpoint(unloader))
admin.POST("/:id/backends/install", localai.InstallBackendOnNodeEndpoint(unloader, galleryService, opcache, appConfig))
admin.POST("/:id/backends/delete", localai.DeleteBackendOnNodeEndpoint(unloader))
// Model management on workers

View File

@@ -10,6 +10,7 @@ import (
"net/http"
"net/url"
"slices"
"sort"
"strconv"
"strings"
"time"
@@ -57,7 +58,6 @@ var usecaseFilters = map[string]config.ModelConfigUsecase{
config.UsecaseRealtimeAudio: config.FLAG_REALTIME_AUDIO,
}
// extractHFRepo tries to find a HuggingFace repo ID from model overrides or URLs.
func extractHFRepo(overrides map[string]any, urls []string) string {
if overrides != nil {
@@ -214,6 +214,17 @@ func RegisterUIAPIRoutes(app *echo.Echo, cl *config.ModelConfigLoader, ml *model
}
}
// Node-scoped backend ops (from /api/nodes/:id/backends/install)
// carry the nodeID inside the opcache key as "node:<nodeID>:<backend>".
// Pull it back out so the operations panel can label which node the
// install is targeting, and so the display name is just the backend
// slug instead of the full prefixed key.
scopedNodeID := ""
if nodeID, backend, ok := galleryop.ParseNodeScopedKey(galleryID); ok {
scopedNodeID = nodeID
galleryID = backend
}
// Extract display name (remove repo prefix if exists)
displayName := galleryID
if strings.Contains(galleryID, "@") {
@@ -237,9 +248,53 @@ func RegisterUIAPIRoutes(app *echo.Echo, cl *config.ModelConfigLoader, ml *model
"cancellable": isCancellable,
"message": message,
}
// Only attach nodeID when this op was node-scoped: an empty string
// would mislead the UI into rendering a node attribution that never
// existed in the first place.
if scopedNodeID != "" {
opData["nodeID"] = scopedNodeID
}
if status != nil && status.Error != nil {
opData["error"] = status.Error.Error()
}
// Expose the per-node breakdown when the Phase 4 progress sink
// has populated OpStatus.Nodes (distributed backend installs).
// We sort by node_name for stable UI rendering across polls;
// the underlying slice is order-dependent on UpdateNodeProgress
// arrival order, which the UI must not depend on. Single-node
// ops and model installs leave Nodes empty so this block emits
// no key, preserving the legacy payload shape.
if status != nil && len(status.Nodes) > 0 {
nodes := make([]map[string]any, 0, len(status.Nodes))
for _, n := range status.Nodes {
entry := map[string]any{
"node_id": n.NodeID,
"node_name": n.NodeName,
"status": n.Status,
"percentage": n.Percentage,
}
if n.FileName != "" {
entry["file_name"] = n.FileName
}
if n.Current != "" {
entry["current"] = n.Current
}
if n.Total != "" {
entry["total"] = n.Total
}
if n.Phase != "" {
entry["phase"] = n.Phase
}
if n.Error != "" {
entry["error"] = n.Error
}
nodes = append(nodes, entry)
}
sort.SliceStable(nodes, func(i, j int) bool {
return fmt.Sprintf("%v", nodes[i]["node_name"]) < fmt.Sprintf("%v", nodes[j]["node_name"])
})
opData["nodes"] = nodes
}
operations = append(operations, opData)
}
@@ -540,11 +595,11 @@ func RegisterUIAPIRoutes(app *echo.Echo, cl *config.ModelConfigLoader, ml *model
NodeStatus string `json:"node_status"`
}
type modelCapability struct {
ID string `json:"id"`
Capabilities []string `json:"capabilities"`
Backend string `json:"backend"`
Disabled bool `json:"disabled"`
Pinned bool `json:"pinned"`
ID string `json:"id"`
Capabilities []string `json:"capabilities"`
Backend string `json:"backend"`
Disabled bool `json:"disabled"`
Pinned bool `json:"pinned"`
// LoadedOn is populated only when the node registry is active
// (distributed mode). Lets the UI show "loaded on worker-1" without
// the operator having to expand every node manually. An empty slice
@@ -1142,17 +1197,17 @@ func RegisterUIAPIRoutes(app *echo.Echo, cl *config.ModelConfigLoader, ml *model
}
return c.JSON(200, map[string]any{
"backends": backendsJSON,
"repositories": appConfig.BackendGalleries,
"allTags": tags,
"processingBackends": processingBackendsData,
"taskTypes": taskTypes,
"availableBackends": totalBackends,
"installedBackends": installedBackendsCount,
"currentPage": pageNum,
"totalPages": totalPages,
"prevPage": prevPage,
"nextPage": nextPage,
"backends": backendsJSON,
"repositories": appConfig.BackendGalleries,
"allTags": tags,
"processingBackends": processingBackendsData,
"taskTypes": taskTypes,
"availableBackends": totalBackends,
"installedBackends": installedBackendsCount,
"currentPage": pageNum,
"totalPages": totalPages,
"prevPage": prevPage,
"nextPage": nextPage,
"systemCapability": detectedCapability,
"preferDevelopmentBackends": appConfig.PreferDevelopmentBackends,
})
@@ -1582,4 +1637,3 @@ func RegisterUIAPIRoutes(app *echo.Echo, cl *config.ModelConfigLoader, ml *model
app.DELETE("/api/branding/asset/:kind", localai.DeleteBrandingAssetEndpoint(appConfig), adminMiddleware)
}

View File

@@ -0,0 +1,155 @@
package routes_test
import (
"encoding/json"
"net/http"
"net/http/httptest"
"github.com/labstack/echo/v4"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/application"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/routes"
"github.com/mudler/LocalAI/core/services/galleryop"
)
// These specs guard the contract between the opcache (which stores
// node-scoped backend installs under a "node:<nodeID>:<backend>" key) and the
// /api/operations response surface the React UI polls. Without nodeID
// extraction the panel would show the raw prefixed key and have no way to
// label which worker an install is targeting.
var _ = Describe("/api/operations with node-scoped backend ops", func() {
// We pass a zero-value *application.Application because the handler's
// distributed-services branch guards on a nil check on the returned
// *DistributedServices, which is nil for a fresh Application{}.
noopMw := func(next echo.HandlerFunc) echo.HandlerFunc { return next }
It("emits nodeID and the un-prefixed backend name for keys built by NodeScopedKey", func() {
appCfg := &config.ApplicationConfig{}
galleryService := galleryop.NewGalleryService(appCfg, nil)
opcache := galleryop.NewOpCache(galleryService)
key := galleryop.NodeScopedKey("worker-7", "llama-cpp")
opcache.SetBackend(key, "job-uuid-123")
e := echo.New()
routes.RegisterUIAPIRoutes(e, nil, nil, appCfg, galleryService, opcache, &application.Application{}, noopMw)
req := httptest.NewRequest(http.MethodGet, "/api/operations", nil)
rec := httptest.NewRecorder()
e.ServeHTTP(rec, req)
Expect(rec.Code).To(Equal(http.StatusOK))
// The handler wraps operations in {"operations": [...]}.
var envelope struct {
Operations []map[string]any `json:"operations"`
}
Expect(json.Unmarshal(rec.Body.Bytes(), &envelope)).To(Succeed())
var found map[string]any
for _, op := range envelope.Operations {
if op["jobID"] == "job-uuid-123" {
found = op
break
}
}
Expect(found).ToNot(BeNil(), "node-scoped op should appear in /api/operations")
Expect(found["nodeID"]).To(Equal("worker-7"))
Expect(found["name"]).To(Equal("llama-cpp"))
Expect(found["isBackend"]).To(Equal(true))
})
It("surfaces per-node OpStatus entries on /api/operations", func() {
appCfg := &config.ApplicationConfig{}
galleryService := galleryop.NewGalleryService(appCfg, nil)
opcache := galleryop.NewOpCache(galleryService)
jobID := "test-op-nodes-1"
// Register a backend op so the handler treats this as a backend
// install (no need to consult the gallery during the test).
opcache.SetBackend("vllm", jobID)
// Populate per-node entries via the P4.2 helper. The helper also
// allocates an OpStatus under jobID, which the handler will read.
galleryService.UpdateNodeProgress(jobID, "node-b", galleryop.NodeProgress{
NodeID: "node-b", NodeName: "worker-b", Status: galleryop.NodeStatusRunningOnWorker,
})
galleryService.UpdateNodeProgress(jobID, "node-a", galleryop.NodeProgress{
NodeID: "node-a", NodeName: "worker-a", Status: galleryop.NodeStatusDownloading, Percentage: 30, FileName: "vllm.tar",
})
e := echo.New()
routes.RegisterUIAPIRoutes(e, nil, nil, appCfg, galleryService, opcache, &application.Application{}, noopMw)
req := httptest.NewRequest(http.MethodGet, "/api/operations", nil)
rec := httptest.NewRecorder()
e.ServeHTTP(rec, req)
Expect(rec.Code).To(Equal(http.StatusOK))
var envelope struct {
Operations []map[string]any `json:"operations"`
}
Expect(json.Unmarshal(rec.Body.Bytes(), &envelope)).To(Succeed())
var found map[string]any
for _, op := range envelope.Operations {
if op["jobID"] == jobID {
found = op
break
}
}
Expect(found).ToNot(BeNil(), "operation should appear in /api/operations")
nodes, ok := found["nodes"].([]any)
Expect(ok).To(BeTrue(), "operation should have a nodes array")
Expect(nodes).To(HaveLen(2))
// Stable sort by node_name: "worker-a" comes before "worker-b"
// even though UpdateNodeProgress was called in reverse order.
first := nodes[0].(map[string]any)
Expect(first["node_name"]).To(Equal("worker-a"))
Expect(first["status"]).To(Equal("downloading"))
Expect(first["file_name"]).To(Equal("vllm.tar"))
Expect(first["percentage"]).To(Equal(30.0))
second := nodes[1].(map[string]any)
Expect(second["node_name"]).To(Equal("worker-b"))
Expect(second["status"]).To(Equal("running_on_worker"))
})
It("does not emit nodeID for non-node-scoped backend ops", func() {
appCfg := &config.ApplicationConfig{}
galleryService := galleryop.NewGalleryService(appCfg, nil)
opcache := galleryop.NewOpCache(galleryService)
// Legacy/global install path: bare backend name as the opcache key.
opcache.SetBackend("llama-cpp", "job-uuid-456")
e := echo.New()
routes.RegisterUIAPIRoutes(e, nil, nil, appCfg, galleryService, opcache, &application.Application{}, noopMw)
req := httptest.NewRequest(http.MethodGet, "/api/operations", nil)
rec := httptest.NewRecorder()
e.ServeHTTP(rec, req)
Expect(rec.Code).To(Equal(http.StatusOK))
var envelope struct {
Operations []map[string]any `json:"operations"`
}
Expect(json.Unmarshal(rec.Body.Bytes(), &envelope)).To(Succeed())
var found map[string]any
for _, op := range envelope.Operations {
if op["jobID"] == "job-uuid-456" {
found = op
break
}
}
Expect(found).ToNot(BeNil())
// Critical: bare ops must NOT gain a misleading empty nodeID field.
Expect(found).ToNot(HaveKey("nodeID"), "non-node-scoped ops must NOT carry a nodeID field")
Expect(found["name"]).To(Equal("llama-cpp"))
})
})

View File

@@ -91,6 +91,21 @@ func (g *GalleryService) backendHandler(op *ManagementOp[gallery.GalleryBackend,
})
return err
}
if errors.Is(err, ErrWorkerStillInstalling) {
// Soft failure: at least one worker timed out replying but is
// still running the install in the background. Mark the op as
// processed with a non-error message so the admin UI shows a
// yellow in-progress state rather than red. The reconciler's
// next pass will reconcile the actual outcome via backend.list.
xlog.Info("worker still installing in background", "backend", op.GalleryElementName, "error", err)
g.UpdateStatus(op.ID, &OpStatus{
Processed: true,
GalleryElementName: op.GalleryElementName,
Message: fmt.Sprintf("backend %s: worker still installing in background; reconciler will confirm completion (%v)", op.GalleryElementName, err),
Cancellable: false,
})
return nil
}
xlog.Error("error installing backend", "error", err, "backend", op.GalleryElementName)
if !op.Delete {
// If we didn't install the backend, we need to make sure we don't have a leftover directory

View File

@@ -196,4 +196,60 @@ var _ = Describe("ManagementOp with External Backend", func() {
Expect(op.ExternalName).To(Equal("test-backend"))
Expect(op.ExternalAlias).To(Equal("test-alias"))
})
Context("TargetNodeID field", func() {
It("defaults to empty string", func() {
op := galleryop.ManagementOp[string, string]{
ExternalURI: "oci://example.com/backend:latest",
}
Expect(op.TargetNodeID).To(BeEmpty())
})
It("preserves TargetNodeID across a channel send", func() {
ch := make(chan galleryop.ManagementOp[string, string], 1)
ch <- galleryop.ManagementOp[string, string]{
GalleryElementName: "llama-cpp",
TargetNodeID: "node-abc-123",
}
received := <-ch
Expect(received.TargetNodeID).To(Equal("node-abc-123"))
Expect(received.GalleryElementName).To(Equal("llama-cpp"))
})
})
Describe("NodeScopedKey", func() {
It("builds a unique key per (nodeID, backend) pair", func() {
Expect(galleryop.NodeScopedKey("node-a", "llama-cpp")).To(Equal("node:node-a:llama-cpp"))
Expect(galleryop.NodeScopedKey("node-b", "llama-cpp")).To(Equal("node:node-b:llama-cpp"))
Expect(galleryop.NodeScopedKey("node-a", "vllm")).To(Equal("node:node-a:vllm"))
})
It("handles backend names containing colons", func() {
// Gallery IDs sometimes look like "official@llama-cpp"; nodeIDs are UUIDs
// without colons, but the backend slug may contain anything. Splitting on
// the first colon after the prefix MUST yield the full backend back.
key := galleryop.NodeScopedKey("node-1", "official@llama-cpp:v2")
node, backend, ok := galleryop.ParseNodeScopedKey(key)
Expect(ok).To(BeTrue())
Expect(node).To(Equal("node-1"))
Expect(backend).To(Equal("official@llama-cpp:v2"))
})
It("rejects keys without the node prefix", func() {
_, _, ok := galleryop.ParseNodeScopedKey("llama-cpp")
Expect(ok).To(BeFalse())
_, _, ok = galleryop.ParseNodeScopedKey("official@llama-cpp")
Expect(ok).To(BeFalse())
})
It("rejects malformed node-prefixed keys", func() {
_, _, ok := galleryop.ParseNodeScopedKey("node:only-one-segment")
Expect(ok).To(BeFalse())
})
It("rejects keys with an empty nodeID segment", func() {
_, _, ok := galleryop.ParseNodeScopedKey("node::llama-cpp")
Expect(ok).To(BeFalse())
})
})
})

View File

@@ -0,0 +1,13 @@
package galleryop
import "errors"
// ErrWorkerStillInstalling indicates a distributed backend install
// timed out at the NATS round-trip layer but the worker is most likely
// still pulling the OCI image in the background. Producers
// (DistributedBackendManager) wrap this when the round-trip times out;
// consumers (backendHandler) use errors.Is(err, ErrWorkerStillInstalling)
// to surface a yellow "in progress" OpStatus instead of a red error,
// leaving the pending_backend_ops row in place for the reconciler to
// confirm via backend.list.
var ErrWorkerStillInstalling = errors.New("worker did not reply in time; install may still be running in the background")

View File

@@ -0,0 +1,149 @@
package galleryop_test
import (
"encoding/json"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/services/galleryop"
)
var _ = Describe("NodeStatus constants", func() {
// Pin the wire-format string values. A future refactor that renames
// a constant must NOT silently change the JSON value the UI receives
// (or the cross-package contract with the nodes package, which
// reuses these constants for NodeOpStatus.Status).
DescribeTable("status constant",
func(actual, expected string) {
Expect(actual).To(Equal(expected))
},
Entry("queued", galleryop.NodeStatusQueued, "queued"),
Entry("downloading", galleryop.NodeStatusDownloading, "downloading"),
Entry("running on worker", galleryop.NodeStatusRunningOnWorker, "running_on_worker"),
Entry("success", galleryop.NodeStatusSuccess, "success"),
Entry("error", galleryop.NodeStatusError, "error"),
)
})
var _ = Describe("OpStatus.Nodes", func() {
It("defaults to empty on a fresh OpStatus", func() {
os := &galleryop.OpStatus{}
Expect(os.Nodes).To(BeEmpty())
})
It("JSON round-trips with all NodeProgress fields", func() {
os := &galleryop.OpStatus{
Nodes: []galleryop.NodeProgress{
{
NodeID: "node-1",
NodeName: "worker-a",
Status: galleryop.NodeStatusRunningOnWorker,
FileName: "vllm.tar.zst",
Current: "412 MB",
Total: "2.1 GB",
Percentage: 19.6,
Phase: "downloading", // literal pins the wire-format value
Error: "",
},
},
}
raw, err := json.Marshal(os)
Expect(err).ToNot(HaveOccurred())
got := &galleryop.OpStatus{}
Expect(json.Unmarshal(raw, got)).To(Succeed())
Expect(got.Nodes).To(HaveLen(1))
Expect(got.Nodes[0]).To(Equal(os.Nodes[0]))
})
})
var _ = Describe("GalleryService.UpdateNodeProgress", func() {
var svc *galleryop.GalleryService
BeforeEach(func() {
// UpdateNodeProgress + GetStatus only touch the in-memory statuses
// map. A zero-value ApplicationConfig is enough to get past the
// LocalModelManager / LocalBackendManager constructors.
svc = galleryop.NewGalleryService(&config.ApplicationConfig{}, nil)
})
It("creates a node entry on first call", func() {
svc.UpdateNodeProgress("op1", "n1", galleryop.NodeProgress{
NodeID: "n1", NodeName: "worker-a", Status: galleryop.NodeStatusDownloading, Percentage: 12.0,
})
st := svc.GetStatus("op1")
Expect(st).ToNot(BeNil())
Expect(st.Nodes).To(HaveLen(1))
Expect(st.Nodes[0].NodeID).To(Equal("n1"))
Expect(st.Nodes[0].Percentage).To(Equal(12.0))
})
It("merges subsequent updates into the same NodeID entry, not appending", func() {
svc.UpdateNodeProgress("op1", "n1", galleryop.NodeProgress{NodeID: "n1", NodeName: "worker-a", Status: galleryop.NodeStatusDownloading, Percentage: 12.0})
svc.UpdateNodeProgress("op1", "n1", galleryop.NodeProgress{NodeID: "n1", NodeName: "worker-a", Status: galleryop.NodeStatusDownloading, Percentage: 48.0, FileName: "vllm.tar"})
st := svc.GetStatus("op1")
Expect(st.Nodes).To(HaveLen(1))
Expect(st.Nodes[0].Percentage).To(Equal(48.0))
Expect(st.Nodes[0].FileName).To(Equal("vllm.tar"))
})
It("appends a new entry for a different NodeID", func() {
svc.UpdateNodeProgress("op1", "n1", galleryop.NodeProgress{NodeID: "n1", NodeName: "worker-a", Status: galleryop.NodeStatusDownloading, Percentage: 12.0})
svc.UpdateNodeProgress("op1", "n2", galleryop.NodeProgress{NodeID: "n2", NodeName: "worker-b", Status: galleryop.NodeStatusQueued})
st := svc.GetStatus("op1")
Expect(st.Nodes).To(HaveLen(2))
})
It("mirrors the latest tick into the aggregate OpStatus fields", func() {
svc.UpdateNodeProgress("op1", "n1", galleryop.NodeProgress{
NodeID: "n1", NodeName: "worker-a", Status: galleryop.NodeStatusDownloading,
Percentage: 33.0, FileName: "vllm.tar", Current: "330 MB", Total: "1 GB",
})
st := svc.GetStatus("op1")
Expect(st.Progress).To(Equal(33.0))
Expect(st.FileName).To(Equal("vllm.tar"))
Expect(st.DownloadedFileSize).To(Equal("330 MB"))
Expect(st.TotalFileSize).To(Equal("1 GB"))
})
It("preserves accumulated Nodes when a subsequent UpdateStatus comes through the legacy path", func() {
// Regression: the Phase 2 progress bridge also calls the legacy
// progressCb -> UpdateStatus(opID, &OpStatus{...}) on every tick.
// Without preservation that overwrite would wipe the Nodes slice
// and the UI would flicker between one node and another on a
// multi-worker install. UpdateStatus must carry forward existing
// Nodes when the incoming op has none.
svc.UpdateNodeProgress("op1", "n1", galleryop.NodeProgress{NodeID: "n1", NodeName: "worker-a", Status: galleryop.NodeStatusSuccess})
svc.UpdateNodeProgress("op1", "n2", galleryop.NodeProgress{NodeID: "n2", NodeName: "worker-b", Status: galleryop.NodeStatusDownloading, Percentage: 30.0})
// Now simulate the legacy progressCb path: a fresh OpStatus
// pointer with no Nodes set, carrying only aggregate fields.
svc.UpdateStatus("op1", &galleryop.OpStatus{
Progress: 30.0,
Message: "downloading",
})
st := svc.GetStatus("op1")
Expect(st.Nodes).To(HaveLen(2), "Nodes accumulated before the legacy UpdateStatus must be preserved")
ids := []string{st.Nodes[0].NodeID, st.Nodes[1].NodeID}
Expect(ids).To(ConsistOf("n1", "n2"))
})
It("allows an explicit empty-then-populated Nodes transition to win when caller sets Nodes", func() {
// If a caller explicitly passes a non-empty Nodes slice on the
// incoming op, that should replace the existing slice (no merge).
// Only an EMPTY incoming slice triggers the carry-forward.
svc.UpdateNodeProgress("op1", "n1", galleryop.NodeProgress{NodeID: "n1", NodeName: "worker-a", Status: galleryop.NodeStatusSuccess})
svc.UpdateStatus("op1", &galleryop.OpStatus{
Progress: 100.0,
Nodes: []galleryop.NodeProgress{
{NodeID: "n9", NodeName: "worker-final", Status: galleryop.NodeStatusSuccess},
},
})
st := svc.GetStatus("op1")
Expect(st.Nodes).To(HaveLen(1))
Expect(st.Nodes[0].NodeID).To(Equal("n9"))
})
})

View File

@@ -2,6 +2,7 @@ package galleryop
import (
"context"
"strings"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/xsync"
@@ -30,6 +31,12 @@ type ManagementOp[T any, E any] struct {
ExternalName string // Custom name for the backend
ExternalAlias string // Custom alias for the backend
// TargetNodeID scopes a backend install/upgrade to a single worker node.
// Empty means fan out to every healthy backend node (the previous behavior).
// Set by InstallBackendOnNodeEndpoint so an admin can install a hardware-specific
// build on one node without touching the rest of the cluster.
TargetNodeID string
// Upgrade is true if this is an upgrade operation (not a fresh install)
Upgrade bool
}
@@ -46,6 +53,45 @@ type OpStatus struct {
GalleryElementName string `json:"gallery_element_name"`
Cancelled bool `json:"cancelled"` // Cancelled is true if the operation was cancelled
Cancellable bool `json:"cancellable"` // Cancellable is true if the operation can be cancelled
// Nodes is the per-node breakdown for a fanned-out backend install.
// Populated by DistributedBackendManager (per-node terminal status)
// and by the Phase 2 progress bridge (per-byte ticks). The
// /api/operations handler surfaces this so the UI can render an
// expandable per-node view of an in-flight install.
Nodes []NodeProgress `json:"nodes,omitempty"`
}
// NodeStatus values shared between NodeProgress (per-node tick) and the
// NodeOpStatus surfaced by DistributedBackendManager's fan-out. Defined
// as exported constants so producers (the manager, the progress bridge)
// and consumers (the /api/operations handler, the React OperationsBar
// through its JSON contract) stay in sync via a single source of truth.
const (
NodeStatusQueued = "queued" // node accepted the intent but install has not started
NodeStatusDownloading = "downloading" // worker is actively pulling the OCI image
NodeStatusRunningOnWorker = "running_on_worker" // NATS round-trip timed out but worker is still installing
NodeStatusSuccess = "success" // install completed on this node
NodeStatusError = "error" // install failed on this node
)
// NodeProgress is a single node's contribution to a backend install
// operation. Populated by DistributedBackendManager (per-node terminal
// status) and by the Phase 2 progress bridge (per-byte ticks). Read by
// the /api/operations handler so the UI can render an expandable
// per-node breakdown.
//
// Status holds one of the NodeStatus* constants above.
type NodeProgress struct {
NodeID string `json:"node_id"`
NodeName string `json:"node_name"`
Status string `json:"status"`
FileName string `json:"file_name,omitempty"`
Current string `json:"current,omitempty"`
Total string `json:"total,omitempty"`
Percentage float64 `json:"percentage"`
Phase string `json:"phase,omitempty"`
Error string `json:"error,omitempty"`
}
type OpCache struct {
@@ -115,3 +161,31 @@ func (m *OpCache) GetStatus() (map[string]string, map[string]string) {
return processingModelsData, taskTypes
}
// NodeScopedKeyPrefix is the opcache key prefix used by InstallBackendOnNodeEndpoint
// so per-node installs do not collide on the bare backend name. Format:
// "node:<nodeID>:<backend>". Read by /api/operations to extract nodeID for the UI.
const NodeScopedKeyPrefix = "node:"
// NodeScopedKey returns the opcache key for a node-scoped backend operation.
// The prefix lets ParseNodeScopedKey detach the nodeID back out so the
// operations endpoint can surface it without storing nodeID separately.
func NodeScopedKey(nodeID, backend string) string {
return NodeScopedKeyPrefix + nodeID + ":" + backend
}
// ParseNodeScopedKey extracts (nodeID, backend) from a key built by NodeScopedKey.
// Returns ok=false for keys that lack the prefix or are missing the nodeID or
// backend segment. Backend names containing colons are preserved because we
// split on the first colon after the prefix only.
func ParseNodeScopedKey(key string) (nodeID, backend string, ok bool) {
rest, hasPrefix := strings.CutPrefix(key, NodeScopedKeyPrefix)
if !hasPrefix {
return "", "", false
}
nodeID, backend, ok = strings.Cut(rest, ":")
if !ok || nodeID == "" || backend == "" {
return "", "", false
}
return nodeID, backend, true
}

View File

@@ -110,6 +110,18 @@ func (g *GalleryService) DeleteBackend(name string) error {
func (g *GalleryService) UpdateStatus(s string, op *OpStatus) {
g.Lock()
defer g.Unlock()
// Preserve any per-node entries already accumulated by UpdateNodeProgress:
// the legacy progressCb path (used by the Phase 2 install bridge) calls
// UpdateStatus with a fresh *OpStatus on every tick, which would otherwise
// wipe the Nodes slice and leave the UI flickering between one node and
// another. If the caller explicitly populates Nodes on the incoming op,
// that wins; an empty Nodes slice on the incoming op is treated as "no
// new per-node data" and the previous Nodes are carried forward.
if op != nil && len(op.Nodes) == 0 {
if prev := g.statuses[s]; prev != nil && len(prev.Nodes) > 0 {
op.Nodes = prev.Nodes
}
}
g.statuses[s] = op
// Persist to PostgreSQL in distributed mode
@@ -135,6 +147,47 @@ func (g *GalleryService) UpdateStatus(s string, op *OpStatus) {
}
}
// UpdateNodeProgress merges a per-node progress tick into OpStatus.Nodes,
// keyed by nodeID, and mirrors the latest values into the aggregate
// Progress / FileName / DownloadedFileSize / TotalFileSize / Message
// fields so the legacy single-bar OperationsBar view keeps working
// unchanged alongside the new per-node breakdown.
//
// We deliberately do NOT delegate the aggregate mirror to UpdateStatus
// here: UpdateStatus overwrites the entire OpStatus, which would clobber
// the Nodes slice we just merged into. Doing the merge + mirror under a
// single lock keeps both views consistent and concurrent-safe.
func (g *GalleryService) UpdateNodeProgress(opID, nodeID string, np NodeProgress) {
g.Lock()
defer g.Unlock()
status := g.statuses[opID]
if status == nil {
status = &OpStatus{}
g.statuses[opID] = status
}
merged := false
for i := range status.Nodes {
if status.Nodes[i].NodeID == nodeID {
status.Nodes[i] = np
merged = true
break
}
}
if !merged {
status.Nodes = append(status.Nodes, np)
}
// Mirror the latest tick into the legacy aggregate fields so the
// existing single-bar UI keeps rendering meaningful progress.
status.FileName = np.FileName
status.Progress = np.Percentage
status.DownloadedFileSize = np.Current
status.TotalFileSize = np.Total
if np.Phase != "" {
status.Message = np.Phase
}
}
func (g *GalleryService) GetStatus(s string) *OpStatus {
g.Lock()
defer g.Unlock()

View File

@@ -0,0 +1,36 @@
package messaging
// Phase values published on the BackendInstallProgressEvent.Phase field.
// Defined as exported constants so producer (worker install handler) and
// consumer (master bridge into OpStatus) share a single source of truth
// instead of two copies of the literal string.
const (
PhaseResolving = "resolving" // worker is locating the gallery / image manifest
PhaseDownloading = "downloading" // worker is actively pulling layers
PhaseExtracting = "extracting" // worker is unpacking the downloaded archive
PhaseStarting = "starting" // worker is spawning the gRPC backend process
)
// BackendInstallProgressEvent is the wire payload published by a worker to
// nodes.<nodeID>.backend.install.<opID>.progress while a long-running install
// is in flight. Transient: dropped events are acceptable, the master relies
// on BackendInstallReply for ground truth on success/failure.
//
// Phase holds one of the Phase* constants above.
type BackendInstallProgressEvent struct {
OpID string `json:"op_id"`
NodeID string `json:"node_id"`
Backend string `json:"backend"`
FileName string `json:"file_name,omitempty"`
Current string `json:"current,omitempty"` // human-readable size, e.g. "412 MB"
Total string `json:"total,omitempty"` // human-readable size, e.g. "2.1 GB"
Percentage float64 `json:"percentage"`
Phase string `json:"phase,omitempty"`
}
// SubjectNodeBackendInstallProgress returns the NATS subject for transient
// progress events emitted by a worker during a single backend.install run.
// Per-op so multiple concurrent installs on the same node never alias.
func SubjectNodeBackendInstallProgress(nodeID, opID string) string {
return subjectNodePrefix + sanitizeSubjectToken(nodeID) + ".backend.install." + sanitizeSubjectToken(opID) + ".progress"
}

View File

@@ -0,0 +1,66 @@
package messaging_test
import (
"encoding/json"
"strings"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/services/messaging"
)
var _ = Describe("Phase constants", func() {
// Pin the wire-format string values. A future refactor that renames
// a constant must NOT silently change the JSON value the master
// receives or break consumers that switch on Phase.
DescribeTable("phase constant",
func(actual, expected string) {
Expect(actual).To(Equal(expected))
},
Entry("resolving", messaging.PhaseResolving, "resolving"),
Entry("downloading", messaging.PhaseDownloading, "downloading"),
Entry("extracting", messaging.PhaseExtracting, "extracting"),
Entry("starting", messaging.PhaseStarting, "starting"),
)
})
var _ = Describe("BackendInstallProgress", func() {
Context("SubjectNodeBackendInstallProgress", func() {
It("composes the per-op progress subject", func() {
Expect(messaging.SubjectNodeBackendInstallProgress("node-abc", "op-123")).
To(Equal("nodes.node-abc.backend.install.op-123.progress"))
})
It("sanitizes NATS-reserved characters in node and op tokens", func() {
// '.' is the NATS hierarchy delimiter, '*' and '>' are wildcards,
// and whitespace must be stripped - sanitizeSubjectToken replaces
// all of them with '-'. The resulting subject must still parse as
// exactly six hierarchy segments: nodes/<node>/backend/install/<op>/progress.
subj := messaging.SubjectNodeBackendInstallProgress("a.b c", "x.y z")
Expect(subj).ToNot(ContainSubstring(" "))
Expect(strings.Count(subj, ".")).To(Equal(5))
})
})
Context("BackendInstallProgressEvent", func() {
It("JSON round-trips with all known fields", func() {
ev := messaging.BackendInstallProgressEvent{
OpID: "op-123",
NodeID: "node-abc",
Backend: "vllm",
FileName: "vllm-cpu.tar.zst",
Current: "412 MB",
Total: "2.1 GB",
Percentage: 19.6,
Phase: "downloading",
}
raw, err := json.Marshal(ev)
Expect(err).ToNot(HaveOccurred())
var got messaging.BackendInstallProgressEvent
Expect(json.Unmarshal(raw, &got)).To(Succeed())
Expect(got).To(Equal(ev))
})
})
})

View File

@@ -144,6 +144,12 @@ type BackendInstallRequest struct {
// worker still works (the master's install fallback path also uses this
// when backend.upgrade returns nats.ErrNoResponders).
Force bool `json:"force,omitempty"`
// OpID identifies the admin-side operation. When non-empty the worker
// publishes BackendInstallProgressEvent values to
// SubjectNodeBackendInstallProgress(nodeID, OpID) while the install is
// running, debounced to roughly 250ms. Empty means the caller is a
// reconciler-driven retry that does not need progress streamed.
OpID string `json:"op_id,omitempty"`
}
// BackendInstallReply is the response from a backend.install NATS request.

View File

@@ -0,0 +1,120 @@
package nodes
import (
"sync"
"time"
"github.com/mudler/LocalAI/core/services/messaging"
)
// DebouncedInstallProgressPublisher buffers backend-install download ticks
// and publishes them to the per-op NATS progress subject at most once per
// `interval`. Always publishes the final event on Flush so the UI sees the
// terminal percentage.
//
// Behavior: leading-edge debounce. The first OnDownload after a quiet window
// publishes immediately; subsequent ticks within `interval` only buffer the
// latest event, which is then emitted via a single trailing timer. This
// keeps the wire chatter bounded (~4 events per second at 250ms) while
// still surfacing every meaningful percentage jump.
//
// Lock ordering: never hold p.mu across a Publish call. Publish hits the
// NATS client which may block on a slow link, and we don't want a stalled
// network to stall the underlying gallery download loop.
type DebouncedInstallProgressPublisher struct {
mu sync.Mutex
client messaging.MessagingClient
subject string
nodeID string
opID string
backend string
interval time.Duration
lastPublishedAt time.Time
pending *messaging.BackendInstallProgressEvent
timer *time.Timer
}
// NewDebouncedInstallProgressPublisher constructs a publisher for one
// install operation. interval is the leading-edge debounce window
// (~250ms in production).
func NewDebouncedInstallProgressPublisher(client messaging.MessagingClient, nodeID, opID, backend string, interval time.Duration) *DebouncedInstallProgressPublisher {
return &DebouncedInstallProgressPublisher{
client: client,
subject: messaging.SubjectNodeBackendInstallProgress(nodeID, opID),
nodeID: nodeID,
opID: opID,
backend: backend,
interval: interval,
}
}
// OnDownload is the callback shape gallery.InstallBackendFromGallery and
// galleryop.InstallExternalBackend pass into the worker. Each invocation
// represents a single tick from the underlying io.Reader copy loop.
func (p *DebouncedInstallProgressPublisher) OnDownload(file, current, total string, percentage float64) {
ev := messaging.BackendInstallProgressEvent{
OpID: p.opID,
NodeID: p.nodeID,
Backend: p.backend,
FileName: file,
Current: current,
Total: total,
Percentage: percentage,
Phase: messaging.PhaseDownloading,
}
p.mu.Lock()
now := time.Now()
if p.lastPublishedAt.IsZero() || now.Sub(p.lastPublishedAt) >= p.interval {
// Leading edge: publish immediately.
p.lastPublishedAt = now
p.pending = nil
p.mu.Unlock()
_ = p.client.Publish(p.subject, ev)
return
}
// Within the window: buffer the latest event and arm a trailing
// publish. If a timer is already armed, we just overwrite p.pending so
// the trailing publish carries the freshest data.
p.pending = &ev
if p.timer == nil {
delay := p.interval - now.Sub(p.lastPublishedAt)
p.timer = time.AfterFunc(delay, p.flushPending)
}
p.mu.Unlock()
}
// flushPending is the trailing-edge publisher fired by the AfterFunc timer.
// It clears the pending slot under the lock, then publishes outside the
// lock so Publish never blocks an in-progress OnDownload call.
func (p *DebouncedInstallProgressPublisher) flushPending() {
p.mu.Lock()
p.timer = nil
pending := p.pending
p.pending = nil
if pending != nil {
p.lastPublishedAt = time.Now()
}
p.mu.Unlock()
if pending != nil {
_ = p.client.Publish(p.subject, *pending)
}
}
// Flush publishes any pending buffered event synchronously and stops the
// pending timer. Safe to call multiple times. Callers MUST defer Flush
// after constructing the publisher so the terminal percentage reaches the
// master even on error returns.
func (p *DebouncedInstallProgressPublisher) Flush() {
p.mu.Lock()
if p.timer != nil {
p.timer.Stop()
p.timer = nil
}
pending := p.pending
p.pending = nil
p.mu.Unlock()
if pending != nil {
_ = p.client.Publish(p.subject, *pending)
}
}

View File

@@ -0,0 +1,48 @@
package nodes
import (
"time"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/services/messaging"
)
var _ = Describe("DebouncedInstallProgressPublisher", func() {
It("publishes the first event immediately and debounces subsequent ones within the window", func() {
mc := newScriptedMessagingClient()
pub := NewDebouncedInstallProgressPublisher(mc, "n1", "op1", "vllm", 50*time.Millisecond)
// Three rapid-fire ticks within the debounce window.
pub.OnDownload("vllm.tar.zst", "100 MB", "1 GB", 10.0)
pub.OnDownload("vllm.tar.zst", "200 MB", "1 GB", 20.0)
pub.OnDownload("vllm.tar.zst", "300 MB", "1 GB", 30.0)
pub.Flush()
// First event publishes immediately; the others coalesce; Flush guarantees a final.
// So we expect at least 2 publishes and at most 4 (lead + final + any window-bounded).
Eventually(func() int {
return len(mc.publishCalls(messaging.SubjectNodeBackendInstallProgress("n1", "op1")))
}, "1s").Should(BeNumerically(">=", 2))
calls := mc.publishCalls(messaging.SubjectNodeBackendInstallProgress("n1", "op1"))
Expect(len(calls)).To(BeNumerically("<=", 4),
"three ticks within the debounce window should produce at most ~4 publishes")
})
It("publishes the final event after Flush with the latest percentage", func() {
mc := newScriptedMessagingClient()
pub := NewDebouncedInstallProgressPublisher(mc, "n1", "op1", "vllm", 50*time.Millisecond)
pub.OnDownload("vllm.tar.zst", "1 GB", "1 GB", 100.0)
pub.Flush()
Eventually(func() float64 {
calls := mc.publishCalls(messaging.SubjectNodeBackendInstallProgress("n1", "op1"))
if len(calls) == 0 {
return -1
}
return calls[len(calls)-1].Percentage
}, "1s").Should(Equal(100.0))
})
})

View File

@@ -10,6 +10,7 @@ import (
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/gallery"
"github.com/mudler/LocalAI/core/services/galleryop"
"github.com/mudler/LocalAI/core/services/messaging"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/LocalAI/pkg/system"
"github.com/mudler/xlog"
@@ -48,6 +49,13 @@ func (d *DistributedModelManager) InstallModel(ctx context.Context, op *galleryo
return d.local.InstallModel(ctx, op, progressCb)
}
// nodeProgressSink is the narrow interface DistributedBackendManager uses to
// publish per-node progress without dragging in the full *GalleryService.
// nil means "no sink, skip per-node writes" (used by single-node tests).
type nodeProgressSink interface {
UpdateNodeProgress(opID, nodeID string, np galleryop.NodeProgress)
}
// DistributedBackendManager wraps a local BackendManager and adds NATS fan-out
// for backend deletion so worker nodes clean up stale files.
type DistributedBackendManager struct {
@@ -56,26 +64,31 @@ type DistributedBackendManager struct {
registry *NodeRegistry
backendGalleries []config.Gallery
systemState *system.SystemState
progressSink nodeProgressSink
}
// NewDistributedBackendManager creates a DistributedBackendManager.
func NewDistributedBackendManager(appConfig *config.ApplicationConfig, ml *model.ModelLoader, adapter *RemoteUnloaderAdapter, registry *NodeRegistry) *DistributedBackendManager {
// progressSink may be nil to disable per-node OpStatus writes (single-node
// tests don't need it).
func NewDistributedBackendManager(appConfig *config.ApplicationConfig, ml *model.ModelLoader, adapter *RemoteUnloaderAdapter, registry *NodeRegistry, progressSink nodeProgressSink) *DistributedBackendManager {
return &DistributedBackendManager{
local: galleryop.NewLocalBackendManager(appConfig, ml),
adapter: adapter,
registry: registry,
backendGalleries: appConfig.BackendGalleries,
systemState: appConfig.SystemState,
progressSink: progressSink,
}
}
// NodeOpStatus is the per-node outcome of a backend lifecycle operation.
// Returned as part of BackendOpResult so the frontend can surface exactly
// what happened on each worker instead of a single joined error string.
// Status holds one of the galleryop.NodeStatus* constants.
type NodeOpStatus struct {
NodeID string `json:"node_id"`
NodeName string `json:"node_name"`
Status string `json:"status"` // "success" | "queued" | "error"
Status string `json:"status"`
Error string `json:"error,omitempty"`
}
@@ -93,7 +106,7 @@ type BackendOpResult struct {
func (r BackendOpResult) Err() error {
var failures []string
for _, n := range r.Nodes {
if n.Status == "error" {
if n.Status == galleryop.NodeStatusError {
failures = append(failures, fmt.Sprintf("%s: %s", n.NodeName, n.Error))
}
}
@@ -116,25 +129,48 @@ func (r BackendOpResult) Err() error {
// when the node returns.
// targetNodeIDs is an optional allowlist: when non-nil, only nodes whose ID is
// in the set are visited. Used by UpgradeBackend to avoid asking nodes that
// never had the backend installed to "upgrade" it such requests fail at the
// never had the backend installed to "upgrade" it - such requests fail at the
// gallery (no platform variant) and would otherwise leave a forever-retrying
// pending_backend_ops row. nil means "fan out to every node" (Install/Delete).
func (d *DistributedBackendManager) enqueueAndDrainBackendOp(ctx context.Context, op, backend string, galleriesJSON []byte, targetNodeIDs map[string]bool, apply func(node BackendNode) error) (BackendOpResult, error) {
//
// opID is the gallery operation identifier; when non-empty and progressSink is
// set, every per-node terminal status appended to BackendOpResult is also
// mirrored into the sink so the UI's per-node OpStatus.Nodes view stays in
// lockstep with the manager's view. opID may be empty for ops that aren't
// gallery-tracked (e.g. DeleteBackend's plain code path).
func (d *DistributedBackendManager) enqueueAndDrainBackendOp(ctx context.Context, opID, op, backend string, galleriesJSON []byte, targetNodeIDs map[string]bool, apply func(node BackendNode) error) (BackendOpResult, error) {
allNodes, err := d.registry.List(ctx)
if err != nil {
return BackendOpResult{}, err
}
// emitNodeProgress is a small helper that funnels every NodeOpStatus we
// append to result.Nodes into the per-node OpStatus sink (when configured
// and opID is known). Keeping it inline avoids drift between the
// BackendOpResult view and the sink view - they're written from the same
// code path on the same terminal statuses.
emitNodeProgress := func(node BackendNode, status, errMsg string) {
if d.progressSink == nil || opID == "" {
return
}
d.progressSink.UpdateNodeProgress(opID, node.ID, galleryop.NodeProgress{
NodeID: node.ID,
NodeName: node.Name,
Status: status,
Error: errMsg,
})
}
result := BackendOpResult{Nodes: make([]NodeOpStatus, 0, len(allNodes))}
for _, node := range allNodes {
// Pending nodes haven't been approved yet no intent to apply.
// Pending nodes haven't been approved yet - no intent to apply.
if node.Status == StatusPending {
continue
}
// Backend lifecycle ops only make sense on backend-type workers.
// Agent workers don't subscribe to backend.install/delete/list, so
// enqueueing for them guarantees a forever-retrying row that the
// reconciler can never drain. Silently skip they aren't consumers.
// reconciler can never drain. Silently skip - they aren't consumers.
if node.NodeType != "" && node.NodeType != NodeTypeBackend {
continue
}
@@ -143,19 +179,23 @@ func (d *DistributedBackendManager) enqueueAndDrainBackendOp(ctx context.Context
}
if err := d.registry.UpsertPendingBackendOp(ctx, node.ID, backend, op, galleriesJSON); err != nil {
xlog.Warn("Failed to enqueue backend op", "op", op, "node", node.Name, "backend", backend, "error", err)
errMsg := fmt.Sprintf("enqueue failed: %v", err)
result.Nodes = append(result.Nodes, NodeOpStatus{
NodeID: node.ID, NodeName: node.Name, Status: "error",
Error: fmt.Sprintf("enqueue failed: %v", err),
NodeID: node.ID, NodeName: node.Name, Status: galleryop.NodeStatusError,
Error: errMsg,
})
emitNodeProgress(node, galleryop.NodeStatusError, errMsg)
continue
}
if node.Status != StatusHealthy {
// Intent is recorded; reconciler will retry when the node recovers.
errMsg := fmt.Sprintf("node %s, will retry when healthy", node.Status)
result.Nodes = append(result.Nodes, NodeOpStatus{
NodeID: node.ID, NodeName: node.Name, Status: "queued",
Error: fmt.Sprintf("node %s, will retry when healthy", node.Status),
NodeID: node.ID, NodeName: node.Name, Status: galleryop.NodeStatusQueued,
Error: errMsg,
})
emitNodeProgress(node, galleryop.NodeStatusQueued, errMsg)
continue
}
@@ -167,14 +207,33 @@ func (d *DistributedBackendManager) enqueueAndDrainBackendOp(ctx context.Context
xlog.Debug("Failed to clear pending backend op after success", "error", err)
}
result.Nodes = append(result.Nodes, NodeOpStatus{
NodeID: node.ID, NodeName: node.Name, Status: "success",
NodeID: node.ID, NodeName: node.Name, Status: galleryop.NodeStatusSuccess,
})
emitNodeProgress(node, galleryop.NodeStatusSuccess, "")
continue
}
// Record failure for backoff. If it's an ErrNoResponders, the node's
// gone AWOL mark unhealthy so the router stops picking it too.
// gone AWOL - mark unhealthy so the router stops picking it too.
errMsg := applyErr.Error()
// Worker-still-installing is a "soft" failure: the worker is most
// likely still pulling the OCI image. Keep the row, push NextRetryAt
// out so the reconciler does not immediately re-fire another install
// while the worker is still busy, and report the in-progress state
// to the caller. The next reconciler pass / backend.list confirms
// the actual outcome.
if errors.Is(applyErr, galleryop.ErrWorkerStillInstalling) {
if id, err := d.findPendingRow(ctx, node.ID, backend, op); err == nil {
_ = d.registry.RecordPendingBackendOpInFlight(ctx, id, errMsg, d.adapter.InstallTimeout())
}
result.Nodes = append(result.Nodes, NodeOpStatus{
NodeID: node.ID, NodeName: node.Name, Status: galleryop.NodeStatusRunningOnWorker, Error: errMsg,
})
emitNodeProgress(node, galleryop.NodeStatusRunningOnWorker, errMsg)
continue
}
if errors.Is(applyErr, nats.ErrNoResponders) {
xlog.Warn("No NATS responders for node, marking unhealthy", "node", node.Name, "nodeID", node.ID)
d.registry.MarkUnhealthy(ctx, node.ID)
@@ -183,8 +242,9 @@ func (d *DistributedBackendManager) enqueueAndDrainBackendOp(ctx context.Context
_ = d.registry.RecordPendingBackendOpFailure(ctx, id, errMsg)
}
result.Nodes = append(result.Nodes, NodeOpStatus{
NodeID: node.ID, NodeName: node.Name, Status: "error", Error: errMsg,
NodeID: node.ID, NodeName: node.Name, Status: galleryop.NodeStatusError, Error: errMsg,
})
emitNodeProgress(node, galleryop.NodeStatusError, errMsg)
}
return result, nil
}
@@ -226,7 +286,11 @@ func (d *DistributedBackendManager) DeleteBackend(name string) error {
}
ctx := context.Background()
result, err := d.enqueueAndDrainBackendOp(ctx, OpBackendDelete, name, nil, nil, func(node BackendNode) error {
// Empty opID: plain DeleteBackend isn't gallery-tracked the same way as
// Install/Upgrade (no progress dialog), so we skip the per-node sink
// writes here. DeleteBackendDetailed is the HTTP path that surfaces
// per-node results in its own response.
result, err := d.enqueueAndDrainBackendOp(ctx, "", OpBackendDelete, name, nil, nil, func(node BackendNode) error {
reply, err := d.adapter.DeleteBackend(node.ID, name)
if err != nil {
return err
@@ -249,7 +313,7 @@ func (d *DistributedBackendManager) DeleteBackendDetailed(ctx context.Context, n
if err := d.local.DeleteBackend(name); err != nil && !errors.Is(err, gallery.ErrBackendNotFound) {
return BackendOpResult{}, err
}
return d.enqueueAndDrainBackendOp(ctx, OpBackendDelete, name, nil, nil, func(node BackendNode) error {
return d.enqueueAndDrainBackendOp(ctx, "", OpBackendDelete, name, nil, nil, func(node BackendNode) error {
reply, err := d.adapter.DeleteBackend(node.ID, name)
if err != nil {
return err
@@ -324,22 +388,113 @@ func (d *DistributedBackendManager) ListBackends() (gallery.SystemBackends, erro
result[b.Name] = entry
}
}
// Proactively clear pending_backend_ops install rows whose intent is now
// satisfied: the backend is reported installed on its target node. Without
// this, the row sits in the queue until next_retry_at expires (up to the
// install timeout, default 15m) and the operator UI shows the install as
// "still installing in background" for that whole window even though the
// worker has actually been ready for minutes. We only clear install rows;
// upgrade and delete rows have presence-based semantics that do NOT match
// backend.list confirmation.
d.clearSatisfiedInstallRows(context.Background(), result)
return result, nil
}
// clearSatisfiedInstallRows removes pending_backend_ops install rows whose
// (nodeID, backend) pair now appears in the cluster-wide backend listing.
// Called by ListBackends after fan-out so the proactive clear sees every
// node's report. Best-effort: a DB failure is logged and the row stays for
// the reconciler to drain via its slower path.
func (d *DistributedBackendManager) clearSatisfiedInstallRows(ctx context.Context, backends gallery.SystemBackends) {
rows, err := d.registry.ListPendingBackendOps(ctx)
if err != nil {
xlog.Debug("clearSatisfiedInstallRows: failed to list pending ops", "error", err)
return
}
if len(rows) == 0 {
return
}
// Build a (nodeID, backend) presence set from the listing.
present := make(map[string]map[string]bool, len(backends))
for name, b := range backends {
for _, ref := range b.Nodes {
if present[ref.NodeID] == nil {
present[ref.NodeID] = make(map[string]bool)
}
present[ref.NodeID][name] = true
}
}
for _, row := range rows {
if row.Op != OpBackendInstall {
continue
}
if !present[row.NodeID][row.Backend] {
continue
}
if err := d.registry.DeletePendingBackendOp(ctx, row.ID); err != nil {
xlog.Debug("clearSatisfiedInstallRows: delete failed",
"id", row.ID, "node", row.NodeID, "backend", row.Backend, "error", err)
continue
}
xlog.Info("Reconciler: pending install row satisfied by backend.list",
"node", row.NodeID, "backend", row.Backend)
}
}
// InstallBackend fans out installation through the pending-ops queue so
// non-healthy nodes get retried when they come back instead of being silently
// skipped. Reply success from the NATS round-trip deletes the queue row;
// reply.Success==false is treated as an error so the row stays for retry.
//
// When op.TargetNodeID is set, only that node is visited - the same allowlist
// path UpgradeBackend uses. Empty TargetNodeID preserves the original fan-out
// behavior so the periodic reconciler and /api/backends/install/:id keep
// working unchanged.
func (d *DistributedBackendManager) InstallBackend(ctx context.Context, op *galleryop.ManagementOp[gallery.GalleryBackend, any], progressCb galleryop.ProgressCallback) error {
galleriesJSON, _ := json.Marshal(op.Galleries)
backendName := op.GalleryElementName
result, err := d.enqueueAndDrainBackendOp(ctx, OpBackendInstall, backendName, galleriesJSON, nil, func(node BackendNode) error {
var targetNodeIDs map[string]bool
if op.TargetNodeID != "" {
targetNodeIDs = map[string]bool{op.TargetNodeID: true}
}
result, err := d.enqueueAndDrainBackendOp(ctx, op.ID, OpBackendInstall, backendName, galleriesJSON, targetNodeIDs, func(node BackendNode) error {
// onProgress fans each BackendInstallProgressEvent into two
// observers: the legacy single-bar progressCb (kept so callers
// that only consume the aggregate view keep working) and the
// per-node sink (so OpStatus.Nodes gets a "downloading" tick
// per file/percentage with node attribution). Defined inside the
// loop so each node captures its own node.Name into the closure.
onProgress := func(ev messaging.BackendInstallProgressEvent) {
if progressCb != nil {
progressCb(ev.FileName, ev.Current, ev.Total, ev.Percentage)
}
if d.progressSink != nil && op.ID != "" {
d.progressSink.UpdateNodeProgress(op.ID, ev.NodeID, galleryop.NodeProgress{
NodeID: ev.NodeID,
NodeName: node.Name,
Status: galleryop.NodeStatusDownloading,
FileName: ev.FileName,
Current: ev.Current,
Total: ev.Total,
Percentage: ev.Percentage,
Phase: ev.Phase,
})
}
}
// nil-callback shortcut: when there is nothing to deliver to,
// hand the adapter a nil onProgress so it skips the per-op NATS
// subscription. Matches the pre-Phase-4 bridgeProgressCb semantics.
var onProgressArg func(messaging.BackendInstallProgressEvent)
if progressCb != nil || d.progressSink != nil {
onProgressArg = onProgress
}
// Admin-driven backend install: not tied to a specific replica slot.
// Pass replica 0 the worker's processKey is "backend#0" when no
// Pass replica 0 - the worker's processKey is "backend#0" when no
// modelID is supplied, matching pre-PR4 behavior.
reply, err := d.adapter.InstallBackend(node.ID, backendName, "", string(galleriesJSON), op.ExternalURI, op.ExternalName, op.ExternalAlias, 0)
reply, err := d.adapter.InstallBackend(node.ID, backendName, "", string(galleriesJSON), op.ExternalURI, op.ExternalName, op.ExternalAlias, 0, op.ID, onProgressArg)
if err != nil {
return err
}
@@ -351,7 +506,19 @@ func (d *DistributedBackendManager) InstallBackend(ctx context.Context, op *gall
if err != nil {
return err
}
return result.Err()
if hardErr := result.Err(); hardErr != nil {
return hardErr
}
// No hard failures, but if at least one node reported running_on_worker,
// surface a wrapped ErrWorkerStillInstalling so galleryop can render a
// yellow in-progress state instead of green success. The reconciler
// will confirm the actual outcome on its next pass via backend.list.
for _, n := range result.Nodes {
if n.Status == galleryop.NodeStatusRunningOnWorker {
return fmt.Errorf("%w: %s", galleryop.ErrWorkerStillInstalling, summarizeRunningOnWorker(result.Nodes))
}
}
return nil
}
// UpgradeBackend uses a separate NATS subject (backend.upgrade) so the slow
@@ -382,7 +549,11 @@ func (d *DistributedBackendManager) UpgradeBackend(ctx context.Context, name str
targetNodeIDs[n.NodeID] = true
}
result, err := d.enqueueAndDrainBackendOp(ctx, OpBackendUpgrade, name, galleriesJSON, targetNodeIDs, func(node BackendNode) error {
// Empty opID: the caller (galleryop) doesn't thread an op ID into
// UpgradeBackend today, so we can't tag per-node sink writes with the
// right OpStatus key. Until the upgrade path takes a ManagementOp the
// way InstallBackend does, the sink stays no-op here.
result, err := d.enqueueAndDrainBackendOp(ctx, "", OpBackendUpgrade, name, galleriesJSON, targetNodeIDs, func(node BackendNode) error {
reply, err := d.adapter.UpgradeBackend(node.ID, name, string(galleriesJSON), "", "", "", 0)
if err != nil {
// Rolling-update fallback: an older worker doesn't know
@@ -407,7 +578,18 @@ func (d *DistributedBackendManager) UpgradeBackend(ctx context.Context, name str
if err != nil {
return err
}
return result.Err()
if hardErr := result.Err(); hardErr != nil {
return hardErr
}
// Same in-progress surfacing as InstallBackend: a long-running worker
// upgrade that timed out the NATS round-trip must not be reported as
// green success.
for _, n := range result.Nodes {
if n.Status == galleryop.NodeStatusRunningOnWorker {
return fmt.Errorf("%w: %s", galleryop.ErrWorkerStillInstalling, summarizeRunningOnWorker(result.Nodes))
}
}
return nil
}
// IsDistributed reports that installs from this manager fan out across the
@@ -433,3 +615,16 @@ func (d *DistributedBackendManager) CheckUpgrades(ctx context.Context) (map[stri
// it used to come from the empty frontend filesystem.
return gallery.CheckUpgradesAgainst(ctx, d.backendGalleries, d.systemState, installed)
}
// summarizeRunningOnWorker builds a short human-readable summary of which
// nodes are still installing in the background, for inclusion in the
// wrapped ErrWorkerStillInstalling error.
func summarizeRunningOnWorker(nodes []NodeOpStatus) string {
var names []string
for _, n := range nodes {
if n.Status == galleryop.NodeStatusRunningOnWorker {
names = append(names, n.NodeName)
}
}
return strings.Join(names, ", ")
}

View File

@@ -3,6 +3,7 @@ package nodes
import (
"context"
"encoding/json"
"errors"
"runtime"
"sync"
"time"
@@ -12,6 +13,7 @@ import (
. "github.com/onsi/gomega"
"gorm.io/gorm"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/gallery"
"github.com/mudler/LocalAI/core/services/galleryop"
"github.com/mudler/LocalAI/core/services/messaging"
@@ -22,11 +24,35 @@ import (
// (or error). Used so each fan-out request can simulate a different worker
// outcome without spinning up real NATS.
type scriptedMessagingClient struct {
mu sync.Mutex
replies map[string][]byte
errs map[string]error
calls []requestCall
matchedReplies map[string][]matchedReply
mu sync.Mutex
replies map[string][]byte
errs map[string]error
calls []requestCall
matchedReplies map[string][]matchedReply
publishes []progressPublishCall
scheduledProgressPublishes []scheduledProgressPublish
subscribes []string
}
// progressPublishCall records a single Publish invocation. The progress
// publisher tests assert on the sequence of BackendInstallProgressEvent
// values written to a per-op subject, so we capture both subject and the
// decoded event. Named to avoid clashing with the simpler `publishCall`
// already defined in unloader_test.go (which stores raw JSON bytes for
// non-progress assertions).
type progressPublishCall struct {
Subject string
Event messaging.BackendInstallProgressEvent
}
// scheduledProgressPublish queues a batch of BackendInstallProgressEvent
// values to be delivered the next time Subscribe is called with the matching
// subject. This lets master-side tests assert that the adapter installs its
// handler BEFORE publishing the install request, by scripting events to be
// delivered as soon as the subscription appears.
type scheduledProgressPublish struct {
subject string
events []messaging.BackendInstallProgressEvent
}
// matchedReply lets a test script a canned reply that only fires when the
@@ -98,10 +124,10 @@ func (s *scriptedMessagingClient) scriptReplyMatching(subject string, pred func(
})
}
func (s *scriptedMessagingClient) Request(subject string, data []byte, _ time.Duration) ([]byte, error) {
func (s *scriptedMessagingClient) Request(subject string, data []byte, timeout time.Duration) ([]byte, error) {
s.mu.Lock()
defer s.mu.Unlock()
s.calls = append(s.calls, requestCall{Subject: subject, Data: data})
s.calls = append(s.calls, requestCall{Subject: subject, Data: data, Timeout: timeout})
// Predicate-matched replies take precedence over flat scriptReply.
if matchers, ok := s.matchedReplies[subject]; ok {
@@ -135,8 +161,88 @@ func (s *scriptedMessagingClient) Request(subject string, data []byte, _ time.Du
return nil, &fakeNoRespondersErr{}
}
func (s *scriptedMessagingClient) Publish(_ string, _ any) error { return nil }
func (s *scriptedMessagingClient) Subscribe(_ string, _ func([]byte)) (messaging.Subscription, error) {
// Publish records each call so progress-publisher tests can assert on the
// stream of events written to a subject. The real messaging.Client JSON
// encodes the payload before sending, but our publisher hands a typed
// struct directly, so we handle both shapes.
func (s *scriptedMessagingClient) Publish(subject string, data any) error {
s.mu.Lock()
defer s.mu.Unlock()
switch ev := data.(type) {
case messaging.BackendInstallProgressEvent:
s.publishes = append(s.publishes, progressPublishCall{Subject: subject, Event: ev})
case []byte:
var e messaging.BackendInstallProgressEvent
_ = json.Unmarshal(ev, &e)
s.publishes = append(s.publishes, progressPublishCall{Subject: subject, Event: e})
}
return nil
}
// publishCalls returns every BackendInstallProgressEvent that was published
// to `subject`, in order. Lets tests assert on debounce behavior without
// depending on internal Publish timing.
func (s *scriptedMessagingClient) publishCalls(subject string) []messaging.BackendInstallProgressEvent {
s.mu.Lock()
defer s.mu.Unlock()
out := make([]messaging.BackendInstallProgressEvent, 0)
for _, c := range s.publishes {
if c.Subject != subject {
continue
}
out = append(out, c.Event)
}
return out
}
// scheduleProgressPublish queues a set of BackendInstallProgressEvent values
// to be delivered on the next Subscribe call matching the per-op progress
// subject. A short delay before delivery gives the subscriber time to install
// its message handler before the events arrive.
func (s *scriptedMessagingClient) scheduleProgressPublish(nodeID, opID string, events []messaging.BackendInstallProgressEvent) {
s.mu.Lock()
defer s.mu.Unlock()
s.scheduledProgressPublishes = append(s.scheduledProgressPublishes, scheduledProgressPublish{
subject: messaging.SubjectNodeBackendInstallProgress(nodeID, opID),
events: events,
})
}
// subscribeCalls returns the subjects on which Subscribe was invoked.
// Used to confirm the master skipped subscription when onProgress was nil.
func (s *scriptedMessagingClient) subscribeCalls() []string {
s.mu.Lock()
defer s.mu.Unlock()
out := make([]string, len(s.subscribes))
copy(out, s.subscribes)
return out
}
func (s *scriptedMessagingClient) Subscribe(subject string, handler func([]byte)) (messaging.Subscription, error) {
s.mu.Lock()
s.subscribes = append(s.subscribes, subject)
matched := []scheduledProgressPublish{}
remaining := s.scheduledProgressPublishes[:0]
for _, sp := range s.scheduledProgressPublishes {
if sp.subject == subject {
matched = append(matched, sp)
} else {
remaining = append(remaining, sp)
}
}
s.scheduledProgressPublishes = remaining
s.mu.Unlock()
go func() {
time.Sleep(20 * time.Millisecond)
for _, sp := range matched {
for _, ev := range sp.events {
raw, _ := json.Marshal(ev)
handler(raw)
}
}
}()
return &fakeSubscription{}, nil
}
func (s *scriptedMessagingClient) QueueSubscribe(_ string, _ string, _ func([]byte)) (messaging.Subscription, error) {
@@ -151,8 +257,43 @@ func (s *scriptedMessagingClient) SubscribeReply(_ string, _ func([]byte, func([
func (s *scriptedMessagingClient) IsConnected() bool { return true }
func (s *scriptedMessagingClient) Close() {}
// recordingNodeCall captures a single UpdateNodeProgress invocation so
// per-node OpStatus tests can assert on the sequence of writes the
// DistributedBackendManager fans out into the sink.
type recordingNodeCall struct {
OpID string
NodeID string
Progress galleryop.NodeProgress
}
// recordingProgressSink is a test-only nodeProgressSink that just records
// every call. Used by the per-node OpStatus specs below to assert the
// manager wrote the expected terminal and downloading entries.
type recordingProgressSink struct {
mu sync.Mutex
calls []recordingNodeCall
}
func (r *recordingProgressSink) UpdateNodeProgress(opID, nodeID string, np galleryop.NodeProgress) {
r.mu.Lock()
defer r.mu.Unlock()
r.calls = append(r.calls, recordingNodeCall{OpID: opID, NodeID: nodeID, Progress: np})
}
func (r *recordingProgressSink) callsFor(opID, nodeID string) []galleryop.NodeProgress {
r.mu.Lock()
defer r.mu.Unlock()
out := []galleryop.NodeProgress{}
for _, c := range r.calls {
if c.OpID == opID && c.NodeID == nodeID {
out = append(out, c.Progress)
}
}
return out
}
// fakeNoRespondersErr is the unscripted-subject default. It matches
// nats.ErrNoResponders by string only used when a test forgets to script
// nats.ErrNoResponders by string only - used when a test forgets to script
// a node so the failure is loud but doesn't tickle errors.Is(...) sentinel
// paths the test wasn't deliberately exercising. Tests that DO want the
// real sentinel (e.g. to drive the manager's NoResponders fallback) call
@@ -204,7 +345,7 @@ var _ = Describe("DistributedBackendManager", func() {
Expect(err).ToNot(HaveOccurred())
mc = newScriptedMessagingClient()
adapter = NewRemoteUnloaderAdapter(nil, mc)
adapter = NewRemoteUnloaderAdapter(nil, mc, 3*time.Minute, 15*time.Minute)
mgr = &DistributedBackendManager{
local: stubLocalBackendManager{},
adapter: adapter,
@@ -311,6 +452,304 @@ var _ = Describe("DistributedBackendManager", func() {
Expect(mgr.InstallBackend(ctx, op("vllm-development"), nil)).To(Succeed())
})
})
Context("when op.TargetNodeID is set to a healthy node", func() {
It("installs only on that node, leaving the others untouched", func() {
target := registerHealthyBackend("worker-target", "10.0.0.1:50051")
other := registerHealthyBackend("worker-other", "10.0.0.2:50051")
mc.scriptReply(messaging.SubjectNodeBackendInstall(target.ID),
messaging.BackendInstallReply{Success: true, Address: "10.0.0.1:50100"})
// No reply scripted for `other`: if InstallBackend fans out
// to it, the fakeNoRespondersErr default would surface and
// the test would fail.
targetedOp := &galleryop.ManagementOp[gallery.GalleryBackend, any]{
GalleryElementName: "llama-cpp",
TargetNodeID: target.ID,
}
Expect(mgr.InstallBackend(ctx, targetedOp, nil)).To(Succeed())
mc.mu.Lock()
defer mc.mu.Unlock()
Expect(mc.calls).To(HaveLen(1))
Expect(mc.calls[0].Subject).To(Equal(messaging.SubjectNodeBackendInstall(target.ID)))
Expect(mc.calls[0].Subject).ToNot(Equal(messaging.SubjectNodeBackendInstall(other.ID)))
})
})
Context("when op.TargetNodeID is set to a node that does not exist", func() {
It("returns nil without sending any NATS request", func() {
registerHealthyBackend("worker-a", "10.0.0.1:50051")
ghostOp := &galleryop.ManagementOp[gallery.GalleryBackend, any]{
GalleryElementName: "llama-cpp",
TargetNodeID: "this-id-does-not-exist",
}
Expect(mgr.InstallBackend(ctx, ghostOp, nil)).To(Succeed())
mc.mu.Lock()
defer mc.mu.Unlock()
Expect(mc.calls).To(BeEmpty())
})
})
Context("when InstallBackend times out on a worker", func() {
It("returns galleryop.ErrWorkerStillInstalling and keeps the queue row with NextRetryAt pushed out", func() {
n := registerHealthyBackend("slow", "10.0.0.1:50051")
// Script a NATS timeout on the install subject. The adapter
// wraps this into galleryop.ErrWorkerStillInstalling, which
// the manager should treat as a soft failure.
mc.scriptErr(messaging.SubjectNodeBackendInstall(n.ID), nats.ErrTimeout)
err := mgr.InstallBackend(ctx, op("vllm"), nil)
Expect(err).To(HaveOccurred())
Expect(errors.Is(err, galleryop.ErrWorkerStillInstalling)).To(BeTrue(),
"expected wrapped ErrWorkerStillInstalling, got %v", err)
rows, err := registry.ListPendingBackendOps(ctx)
Expect(err).ToNot(HaveOccurred())
Expect(rows).To(HaveLen(1))
Expect(rows[0].Backend).To(Equal("vllm"))
// The adapter is configured with a 3m install timeout in this
// suite (NewRemoteUnloaderAdapter above). NextRetryAt should
// be ~now+3m; a > now+2m bound is safe-but-tight enough to
// catch the buggy short default (30s exponential backoff).
Expect(rows[0].NextRetryAt).To(BeTemporally(">", time.Now().Add(2*time.Minute)),
"NextRetryAt should be pushed to ~now+installTimeout, not the short default")
})
})
Context("end-to-end: timeout then successful reconcile via backend.list", func() {
It("surfaces the install in ListBackends after the worker finishes", func() {
// Use the same node-registration helper the Task 5 test uses
// so the test fixture is identical to the prior context.
node := registerHealthyBackend("jetson", "10.0.0.2:50051")
// First install attempt: NATS times out. The adapter wraps
// this as galleryop.ErrWorkerStillInstalling and the manager
// keeps the pending_backend_ops row alive with NextRetryAt
// pushed out (asserted in the previous context).
mc.scriptErr(messaging.SubjectNodeBackendInstall(node.ID), nats.ErrTimeout)
err := mgr.InstallBackend(ctx, op("vllm"), nil)
Expect(err).To(HaveOccurred())
Expect(errors.Is(err, galleryop.ErrWorkerStillInstalling)).To(BeTrue(),
"expected wrapped ErrWorkerStillInstalling, got %v", err)
rows, listErr := registry.ListPendingBackendOps(ctx)
Expect(listErr).ToNot(HaveOccurred())
Expect(rows).To(HaveLen(1))
// The worker finished installing in the background. Script
// backend.list on the same scriptedMessagingClient so the
// manager's ListBackends fan-out reports the backend.
mc.scriptReply(messaging.SubjectNodeBackendList(node.ID), messaging.BackendListReply{
Backends: []messaging.NodeBackendInfo{{Name: "vllm"}},
})
backends, listErr := mgr.ListBackends()
Expect(listErr).ToNot(HaveOccurred())
Expect(backends).To(HaveKey("vllm"))
Expect(backends["vllm"].Nodes).To(HaveLen(1))
Expect(backends["vllm"].Nodes[0].NodeID).To(Equal(node.ID))
// Phase 1b shipped: ListBackends proactively clears install rows
// whose intent is now satisfied by backend.list confirmation. The
// operator UI clears immediately instead of waiting for the next
// reconciler tick after NextRetryAt.
rowsAfter, _ := registry.ListPendingBackendOps(ctx)
Expect(rowsAfter).To(BeEmpty(),
"install row should clear once backend.list confirms presence on the target node")
})
})
Context("ListBackends clears confirmed install rows", func() {
It("deletes the pending_backend_ops install row when the backend is reported installed on its target node", func() {
node := registerHealthyBackend("worker-a", "10.0.0.5:50051")
// Pre-stage: simulate an admin install that timed out at the NATS
// round-trip, leaving an install row in the queue.
mc.scriptErr(messaging.SubjectNodeBackendInstall(node.ID), nats.ErrTimeout)
err := mgr.InstallBackend(ctx, op("vllm"), nil)
Expect(err).To(HaveOccurred())
Expect(errors.Is(err, galleryop.ErrWorkerStillInstalling)).To(BeTrue())
rows, _ := registry.ListPendingBackendOps(ctx)
Expect(rows).To(HaveLen(1))
// Worker finishes installing in the background. backend.list now
// confirms presence; ListBackends should proactively clear the row.
mc.scriptReply(messaging.SubjectNodeBackendList(node.ID), messaging.BackendListReply{
Backends: []messaging.NodeBackendInfo{{Name: "vllm"}},
})
backends, listErr := mgr.ListBackends()
Expect(listErr).ToNot(HaveOccurred())
Expect(backends).To(HaveKey("vllm"))
rowsAfter, _ := registry.ListPendingBackendOps(ctx)
Expect(rowsAfter).To(BeEmpty(),
"ListBackends should clear install rows whose intent is now satisfied by backend.list")
})
It("does NOT clear an upgrade row even if the backend is reported installed", func() {
node := registerHealthyBackend("worker-b", "10.0.0.6:50051")
Expect(registry.UpsertPendingBackendOp(ctx, node.ID, "vllm", OpBackendUpgrade, []byte("[]"))).To(Succeed())
mc.scriptReply(messaging.SubjectNodeBackendList(node.ID), messaging.BackendListReply{
Backends: []messaging.NodeBackendInfo{{Name: "vllm"}},
})
_, listErr := mgr.ListBackends()
Expect(listErr).ToNot(HaveOccurred())
rowsAfter, _ := registry.ListPendingBackendOps(ctx)
Expect(rowsAfter).To(HaveLen(1), "upgrade rows must not be cleared by backend.list presence")
})
})
Context("InstallBackend streams progress events to the caller's progressCb", func() {
It("invokes progressCb once per worker-published progress event", func() {
node := registerHealthyBackend("worker-prog", "10.0.0.7:50051")
mc.scriptReply(messaging.SubjectNodeBackendInstall(node.ID), messaging.BackendInstallReply{Success: true, Address: "10.0.0.7:50051"})
mc.scheduleProgressPublish(node.ID, "op-prog-1", []messaging.BackendInstallProgressEvent{
{OpID: "op-prog-1", NodeID: node.ID, Backend: "vllm", FileName: "vllm.tar", Current: "100 MB", Total: "1 GB", Percentage: 10},
{OpID: "op-prog-1", NodeID: node.ID, Backend: "vllm", FileName: "vllm.tar", Current: "1 GB", Total: "1 GB", Percentage: 100},
})
type tick struct {
FileName, Current, Total string
Percentage float64
}
var (
pcCalls []tick
mu sync.Mutex
)
progressCb := func(file, current, total string, pct float64) {
mu.Lock()
defer mu.Unlock()
pcCalls = append(pcCalls, tick{file, current, total, pct})
}
opVal := op("vllm")
opVal.ID = "op-prog-1"
Expect(mgr.InstallBackend(ctx, opVal, progressCb)).To(Succeed())
Eventually(func() int {
mu.Lock()
defer mu.Unlock()
return len(pcCalls)
}, "1s").Should(Equal(2))
mu.Lock()
defer mu.Unlock()
// The adapter dispatches each progress event to its own goroutine
// (see unloader.go: `go onProgress(ev)`) so two events emitted back
// to back can land at the bridge in either order. Assert the set of
// percentages observed contains both ticks, rather than depending
// on goroutine scheduling for ordering.
pcts := []float64{pcCalls[0].Percentage, pcCalls[1].Percentage}
Expect(pcts).To(ConsistOf(10.0, 100.0))
})
})
Context("InstallBackend tolerates silent (pre-Phase-2) workers", func() {
It("completes successfully even when no progress events are ever published", func() {
node := registerHealthyBackend("worker-silent", "10.0.0.8:50051")
mc.scriptReply(messaging.SubjectNodeBackendInstall(node.ID), messaging.BackendInstallReply{Success: true, Address: "10.0.0.8:50051"})
// NO scheduleProgressPublish call - silent worker.
var ticks int
var mu sync.Mutex
progressCb := func(file, current, total string, pct float64) {
mu.Lock()
defer mu.Unlock()
ticks++
}
opVal := op("vllm")
opVal.ID = "op-silent-1"
Expect(mgr.InstallBackend(ctx, opVal, progressCb)).To(Succeed())
Consistently(func() int {
mu.Lock()
defer mu.Unlock()
return ticks
}, "200ms").Should(Equal(0))
})
})
Context("populates per-node OpStatus entries", func() {
var sink *recordingProgressSink
BeforeEach(func() {
// Reconstruct mgr with the recording sink so the new code
// path (per-node OpStatus writes) is exercised. The default
// mgr in the outer BeforeEach has progressSink=nil so the
// pre-existing specs keep verifying the no-sink behavior.
sink = &recordingProgressSink{}
appCfg := &config.ApplicationConfig{}
mgr = NewDistributedBackendManager(appCfg, nil, adapter, registry, sink)
// stubLocalBackendManager mirrors the production behaviour
// where the frontend node rarely has the backend installed
// locally - the NATS fan-out is what these specs verify.
mgr.local = stubLocalBackendManager{}
})
It("emits a success entry for each healthy node visited", func() {
node := registerHealthyBackend("worker-ok", "10.0.0.9:50051")
mc.scriptReply(messaging.SubjectNodeBackendInstall(node.ID),
messaging.BackendInstallReply{Success: true, Address: "10.0.0.9:50051"})
opVal := op("vllm")
opVal.ID = "op-node-success"
Expect(mgr.InstallBackend(ctx, opVal, nil)).To(Succeed())
calls := sink.callsFor("op-node-success", node.ID)
Expect(calls).ToNot(BeEmpty())
Expect(calls[len(calls)-1].Status).To(Equal(galleryop.NodeStatusSuccess))
Expect(calls[len(calls)-1].NodeName).To(Equal("worker-ok"))
})
It("emits a running_on_worker entry when NATS times out", func() {
node := registerHealthyBackend("worker-slow", "10.0.0.10:50051")
mc.scriptErr(messaging.SubjectNodeBackendInstall(node.ID), nats.ErrTimeout)
opVal := op("vllm")
opVal.ID = "op-node-slow"
// Soft failure: returns wrapped ErrWorkerStillInstalling.
_ = mgr.InstallBackend(ctx, opVal, nil)
calls := sink.callsFor("op-node-slow", node.ID)
Expect(calls).ToNot(BeEmpty())
Expect(calls[len(calls)-1].Status).To(Equal(galleryop.NodeStatusRunningOnWorker))
})
It("emits downloading entries from progress events", func() {
node := registerHealthyBackend("worker-dl", "10.0.0.11:50051")
mc.scriptReply(messaging.SubjectNodeBackendInstall(node.ID),
messaging.BackendInstallReply{Success: true})
mc.scheduleProgressPublish(node.ID, "op-node-dl", []messaging.BackendInstallProgressEvent{
{OpID: "op-node-dl", NodeID: node.ID, Backend: "vllm", FileName: "vllm.tar", Current: "1 GB", Total: "1 GB", Percentage: 100, Phase: messaging.PhaseDownloading},
})
opVal := op("vllm")
opVal.ID = "op-node-dl"
Expect(mgr.InstallBackend(ctx, opVal, nil)).To(Succeed())
Eventually(func() bool {
for _, np := range sink.callsFor("op-node-dl", node.ID) {
if np.Status == galleryop.NodeStatusDownloading && np.Percentage == 100.0 {
return true
}
}
return false
}, "1s").Should(BeTrue())
})
})
})
Describe("UpgradeBackend", func() {

View File

@@ -0,0 +1,94 @@
package nodes
import (
"sync"
"time"
"golang.org/x/sync/singleflight"
)
// probeCache memoizes recent successful gRPC HealthCheck results for
// (nodeID, addr) tuples so SmartRouter.probeHealth doesn't pay a round-trip
// on every inference request.
//
// Why this exists: with per-request routing (see pkg/model/loader.go), every
// inference call goes through SmartRouter.Route, which probes the backend
// before returning a client. Many gRPC backends (notably llama.cpp's server)
// serialize HealthCheck against active Predict on a shared goroutine, so a
// burst of new requests can stall behind a single long-running stream —
// exactly the "queue stalling" symptom observed in distributed clusters.
//
// The background HealthMonitor (perModelHealthCheck) is still the cluster-wide
// source of truth that reaps actually-dead backends within ~45s; this cache
// only saves the per-request hot path from re-asking when nothing has changed.
//
// TTL matches healthCheckTTL in pkg/model/model.go so the single-process
// IsRecentlyHealthy path and this distributed-mode path share the same
// staleness budget.
type probeCache struct {
ttl time.Duration
mu sync.Mutex
seen map[string]time.Time // key → last successful probe
flight singleflight.Group // coalesces concurrent probes for the same key
}
// newProbeCache returns a probeCache with the given TTL. Zero TTL disables
// caching: every call to DoOrCached invokes the probe.
func newProbeCache(ttl time.Duration) *probeCache {
return &probeCache{
ttl: ttl,
seen: make(map[string]time.Time),
}
}
// IsFresh reports whether key was successfully probed within TTL.
func (c *probeCache) IsFresh(key string) bool {
if c.ttl <= 0 {
return false
}
c.mu.Lock()
defer c.mu.Unlock()
last, ok := c.seen[key]
return ok && time.Since(last) < c.ttl
}
// markFresh records key as successfully probed at the current time.
func (c *probeCache) markFresh(key string) {
c.mu.Lock()
defer c.mu.Unlock()
c.seen[key] = time.Now()
}
// Invalidate drops any cached freshness for key. Used after a probe failure
// (or any other signal that the backend may not be alive) so the next call
// will re-probe instead of trusting stale state.
func (c *probeCache) Invalidate(key string) {
c.mu.Lock()
defer c.mu.Unlock()
delete(c.seen, key)
}
// DoOrCached returns true if key is fresh; otherwise it runs probe (coalescing
// concurrent callers via singleflight) and caches a successful result. Failed
// probes invalidate the cache, so a transient miss doesn't pin every
// subsequent request to a re-probe.
func (c *probeCache) DoOrCached(key string, probe func() bool) bool {
if c.IsFresh(key) {
return true
}
v, _, _ := c.flight.Do(key, func() (any, error) {
// Double-check after potentially waiting: another caller in this
// flight may have just populated the cache.
if c.IsFresh(key) {
return true, nil
}
ok := probe()
if ok {
c.markFresh(key)
} else {
c.Invalidate(key)
}
return ok, nil
})
return v.(bool)
}

Some files were not shown because too many files have changed in this diff Show More