mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-29 11:07:18 -04:00
When the LocalAI frontend deployment is scaled past one replica, the UI's
/api/operations poll round-robins between pods. Each pod kept the OpCache
(galleryID->jobID), OpStatus map, and the post-install in-memory caches
(ModelConfigLoader, UpgradeChecker) purely in-process. Reads never
consulted PostgreSQL or NATS even though writes already published to PG.
Symptoms:
- A user installing a model on replica A saw the operation card flicker
in and out as the load balancer alternated.
- The Models page re-fetched the whole gallery on every flicker because
useEffect([operations.length]) re-fires when the count changes.
- A chat completion that landed on replica B after the install completed
on replica A failed to find the new model — B's ModelConfigLoader was
still the old one because nothing told it to reload from disk.
- The UpgradeChecker 6-hour cache stayed stale on peer replicas after a
backend upgrade, so /api/backends/upgrades kept surfacing an upgrade
that had already shipped.
Mirror the jobs Dispatcher pattern for gallery ops:
- OpCache learns SetMessagingClient/SetGalleryStore + a Start(ctx) that
hydrates from PostgreSQL and subscribes to gallery.opcache.{start,end}.
Set/SetBackend now upsert cache_key + is_backend_op on the gallery_
operations row and broadcast OpCacheEvent so peers merge it in. The
hydrate path uses a new GalleryStore.ListActive() (status in {pending,
downloading, processing} and updated within 30 min).
- GalleryService.SubscribeBroadcasts wires a SubjectGalleryProgress-
Wildcard subscriber that calls a new lock-light mergeStatus into the
local statuses map, plus a SubjectGalleryCancelWildcard subscriber that
runs the locally-registered cancel func. Hydrate() restores active rows
from PostgreSQL on startup so a freshly-started replica is not
observably empty mid-install. CancelOperation tolerates the cancel func
living on a different replica and publishes anyway.
- modelHandler and backendHandler publish on the new
SubjectCacheInvalidateModels / SubjectCacheInvalidateBackends after
a successful install/delete/upgrade. SubscribeBroadcasts wires peers
to refresh: OnModelsChanged (re-runs LoadModelConfigsFromPath) and
OnBackendOpCompleted (re-triggers UpgradeChecker). The originating
replica reloads inline so it never enters the broadcast handler.
- OpStatus.Error (an error interface) flat-marshalled to "{}" over JSON,
so a failed install replicated to a peer arrived with a nil error and
the UI's failure banner never appeared. Add MarshalJSON/UnmarshalJSON
via an opStatusWire shim that round-trips Error as a string.
- UpdateStatus and CancelOperation now drop the mutex before publishing
to NATS or persisting to PostgreSQL. The wildcard subscriber's
mergeStatus loops back into the same service on the publishing replica
and would deadlock otherwise; this also prevents future PG round-trips
from stalling concurrent readers on every progress tick.
Tests cover the OpStatus error round-trip, OpCache propagation through a
shared in-memory bus, OpCache PostgreSQL hydration (active-only),
GalleryService progress + cancel broadcast, Nodes preservation across a
peer's bare progress tick, GalleryService hydration from PG, and the
two cache-invalidation broadcasts (models + backends). 44 specs total
in galleryop; routes/operations specs and jobs/agents suites still pass.
Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>