Files
LocalAI/core/services/messaging/subjects.go
LocalAI [bot] e5d7b84216 fix(distributed): split NATS backend.upgrade off install + dedup loads (#9717)
* feat(messaging): add backend.upgrade NATS subject + payload types

Splits the slow force-reinstall path off backend.install so it can run on
its own subscription goroutine, eliminating head-of-line blocking between
routine model loads and full gallery upgrades.

Wire-level Force flag on BackendInstallRequest is kept for one release as
the rolling-update fallback target; doc note marks it deprecated.

Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed/worker): add per-backend mutex helper to backendSupervisor

Different backend names lock independently; same backend serializes. This
is the synchronization primitive used by the upcoming concurrent install
handler — without it, wrapping the NATS callback in a goroutine would
race the gallery directory when two requests target the same backend.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed/worker): run backend.install handler in a goroutine

NATS subscriptions deliver messages serially on a single per-subscription
goroutine. With a synchronous install handler, a multi-minute gallery
download would head-of-line-block every other install request to the
same worker — manifesting upstream as a 5-minute "nats: timeout" on
unrelated routine model loads.

The body now runs in its own goroutine, with a per-backend mutex
(lockBackend) protecting the gallery directory from concurrent operations
on the same backend. Different backend names install in parallel.

Backward-compat: req.Force=true is still honored here, so an older master
that hasn't been updated to send on backend.upgrade keeps working.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed/worker): subscribe to backend.upgrade as a separate path

Slow force-reinstall now lives on its own NATS subscription, so a
multi-minute gallery pull cannot head-of-line-block the routine
backend.install handler on the same worker. Same per-backend mutex
guards both — concurrent install + upgrade for the same backend
serialize at the gallery directory; different backends are independent.

upgradeBackend stops every live process for the backend, force-installs
from gallery, and re-registers. It does not start a new process — the
next backend.install will spawn one with the freshly-pulled binary.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): add UpgradeBackend on NodeCommandSender; drop Force from InstallBackend

Master now sends to backend.upgrade for force-reinstall, with a
nats.ErrNoResponders fallback to the legacy backend.install Force=true
path so a rolling update with a new master + an old worker still
converges. The Force parameter leaves the public Go API surface
entirely — only the internal fallback sets it on the wire.

InstallBackend timeout drops 5min -> 3min (most replies are sub-second
since the worker short-circuits on already-running or already-installed).
UpgradeBackend timeout is 15min, sized for real-world Jetson-on-WiFi
gallery pulls.

Updates the admin install HTTP endpoint
(core/http/endpoints/localai/nodes.go) to the new signature too.

router_test.go's fakeUnloader does not yet implement the new interface
shape; Task 3.2 will catch it up before the next package-level test run.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(distributed): update fakeUnloader for new NodeCommandSender shape

InstallBackend lost its force bool param (Force is not part of the public
Go API anymore — only the internal upgrade-fallback path sets it on the
wire). UpgradeBackend gained a method. Fake records both call slices and
provides an installHook concurrency seam for upcoming singleflight tests.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(distributed): cover UpgradeBackend's new subject + rolling-update fallback

Task 3.1 changed the master to publish UpgradeBackend on the new
backend.upgrade subject; the existing UpgradeBackend tests scripted the
old install subject and so all 3 began failing as expected. Updates them
to script SubjectNodeBackendUpgrade with BackendUpgradeReply.

Adds two new specs for the rolling-update fallback:
  - ErrNoResponders on backend.upgrade triggers a backend.install
    Force=true retry on the same node.
  - Non-NoResponders errors propagate to the caller unchanged.

scriptedMessagingClient gains scriptNoResponders (real nats sentinel) and
scriptReplyMatching (predicate-matched canned reply, used to assert that
the fallback path actually sets Force=true on the install retry).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): coalesce concurrent identical backend.install via singleflight

Six simultaneous chat completions for the same not-yet-loaded model were
observed firing six independent NATS install requests, each serializing
through the worker's per-subscription goroutine and amplifying queue
depth. SmartRouter now wraps the NATS round-trip in a singleflight.Group
keyed by (nodeID, backend, modelID, replica): N concurrent identical
loads share one round-trip and one reply.

Distinct (modelID, replica) keys still fire independent calls, so
multi-replica scaling and multi-model fan-out are unaffected.

fakeUnloader gains a sync.Mutex around its recording slices to keep
concurrent test goroutines race-clean.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(e2e/distributed): drop force arg from InstallBackend test calls

Two e2e test call sites still passed the trailing force bool that was
removed from RemoteUnloaderAdapter.InstallBackend in 9bde76d7. Caught
by golangci-lint typecheck on the upgrade-split branch (master CI was
already green because these tests don't run in the standard test path).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(distributed): extract worker business logic to core/services/worker

core/cli/worker.go grew to 1212 lines after the backend.upgrade split.
The CLI package was carrying backendSupervisor, NATS lifecycle handlers,
gallery install/upgrade orchestration, S3 file staging, and registration
helpers — all distributed-worker business logic that doesn't belong in
the cobra surface.

Move it to a new core/services/worker package, mirroring the existing
core/services/{nodes,messaging,galleryop} pattern. core/cli/worker.go
shrinks to ~19 lines: a kong-tagged shim that embeds worker.Config and
delegates Run.

No behavior change. All symbols stay unexported except Config and Run.
The three worker-specific tests (addr/replica/concurrency) move with
the code via git mv so history follows them.

Files split as:
  worker.go        - Run entry point
  config.go        - Config struct (kong tags retained, kong not imported)
  supervisor.go    - backendProcess, backendSupervisor, process lifecycle
  install.go       - installBackend, upgradeBackend, findBackend, lockBackend
  lifecycle.go     - subscribeLifecycleEvents (verbatim, decomposition is
                     a follow-up commit)
  file_staging.go  - subscribeFileStaging, isPathAllowed
  registration.go  - advertiseAddr, registrationBody, heartbeatBody, etc.
  reply.go         - replyJSON
  process_helpers.go - readLastLinesFromFile

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(distributed/worker): decompose subscribeLifecycleEvents into per-event handlers

The 226-line subscribeLifecycleEvents method packed eight NATS subscriptions
inline. Each grew context-shaped doc comments mixed with subscription
plumbing, making it hard to read any one handler without scrolling past the
others. Extract each handler into its own method on *backendSupervisor; the
subscriber becomes a thin 8-line dispatcher.

No behavior change: each method body is byte-equivalent to its corresponding
inline goroutine + handler. Doc comments that were attached to the inline
SubscribeReply calls migrate to the new method godocs.

Adding the next NATS subject is now a 2-line patch to the dispatcher plus
one new method, instead of grafting onto a monolith.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-08 16:24:54 +02:00

324 lines
13 KiB
Go

package messaging
import "strings"
// sanitizeSubjectToken replaces NATS-reserved characters in a subject token.
// NATS uses '.' as hierarchy delimiter and '*'/'>' as wildcards.
func sanitizeSubjectToken(s string) string {
r := strings.NewReplacer(".", "-", "*", "-", ">", "-", " ", "-", "\t", "-", "\n", "-")
return r.Replace(s)
}
// NATS subject constants for the distributed architecture.
// Following the notetaker pattern: <entity>.<action>
// Job Distribution (Queue Groups — load-balanced, one consumer gets each message)
const (
SubjectJobsNew = "jobs.new"
SubjectMCPCIJobsNew = "jobs.mcp-ci.new"
SubjectAgentExecute = "agent.execute"
QueueWorkers = "workers"
)
// Status Updates (Pub/Sub — all subscribers get every message, for SSE bridging)
// These use parameterized subjects: e.g. SubjectAgentEvents("myagent", "user1")
const (
subjectAgentEventsPrefix = "agent."
subjectJobProgressPrefix = "jobs."
subjectFineTunePrefix = "finetune."
subjectGalleryPrefix = "gallery."
)
// SubjectAgentEvents returns the NATS subject for agent SSE events.
func SubjectAgentEvents(agentName, userID string) string {
if userID == "" {
userID = "anonymous"
}
return subjectAgentEventsPrefix + sanitizeSubjectToken(agentName) + ".events." + sanitizeSubjectToken(userID)
}
// SubjectJobProgress returns the NATS subject for job progress updates.
func SubjectJobProgress(jobID string) string {
return subjectJobProgressPrefix + sanitizeSubjectToken(jobID) + ".progress"
}
// SubjectJobResult returns the NATS subject for the final job result (terminal state).
func SubjectJobResult(jobID string) string {
return subjectJobProgressPrefix + sanitizeSubjectToken(jobID) + ".result"
}
// MCP Tool Execution (Request-Reply via NATS — load-balanced across agent workers)
const (
SubjectMCPToolExecute = "mcp.tools.execute"
SubjectMCPDiscovery = "mcp.discovery"
QueueAgentWorkers = "agent-workers"
)
// SubjectFineTuneProgress returns the NATS subject for fine-tune progress.
func SubjectFineTuneProgress(jobID string) string {
return subjectFineTunePrefix + sanitizeSubjectToken(jobID) + ".progress"
}
// SubjectGalleryProgress returns the NATS subject for gallery download progress.
func SubjectGalleryProgress(opID string) string {
return subjectGalleryPrefix + sanitizeSubjectToken(opID) + ".progress"
}
// Control Signals (Pub/Sub — targeted cancellation)
const (
subjectJobCancelPrefix = "jobs."
subjectAgentCancelPrefix = "agent."
subjectFineTuneCancelPrefix = "finetune."
subjectGalleryCancelPrefix = "gallery."
)
// Wildcard subjects for NATS subscriptions that match all IDs.
const (
SubjectJobCancelWildcard = "jobs.*.cancel"
SubjectJobResultWildcard = "jobs.*.result"
SubjectJobProgressWildcard = "jobs.*.progress"
SubjectAgentCancelWildcard = "agent.*.cancel"
SubjectGalleryCancelWildcard = "gallery.*.cancel"
SubjectGalleryProgressWildcard = "gallery.*.progress"
)
// SubjectJobCancel returns the NATS subject to cancel a running job.
func SubjectJobCancel(jobID string) string {
return subjectJobCancelPrefix + sanitizeSubjectToken(jobID) + ".cancel"
}
// SubjectAgentCancel returns the NATS subject to cancel agent execution.
func SubjectAgentCancel(agentID string) string {
return subjectAgentCancelPrefix + sanitizeSubjectToken(agentID) + ".cancel"
}
// SubjectFineTuneCancel returns the NATS subject to stop fine-tuning.
func SubjectFineTuneCancel(jobID string) string {
return subjectFineTuneCancelPrefix + sanitizeSubjectToken(jobID) + ".cancel"
}
// SubjectGalleryCancel returns the NATS subject to cancel a gallery download.
func SubjectGalleryCancel(opID string) string {
return subjectGalleryCancelPrefix + sanitizeSubjectToken(opID) + ".cancel"
}
// Node Backend Lifecycle (Pub/Sub — targeted to specific nodes)
//
// These subjects control the backend *process* lifecycle on a serve-backend node,
// mirroring how the local ModelLoader uses startProcess() / deleteProcess().
//
// Model loading (LoadModel gRPC) is done via direct gRPC calls to the node's
// address — no NATS needed for that, same as local mode.
const (
subjectNodePrefix = "nodes."
)
// SubjectNodeBackendInstall tells a worker node to install a backend and start its gRPC process.
// Uses NATS request-reply: the SmartRouter sends the request, the worker installs
// the backend from gallery (if not already installed), starts the gRPC process,
// and replies when ready.
func SubjectNodeBackendInstall(nodeID string) string {
return subjectNodePrefix + sanitizeSubjectToken(nodeID) + ".backend.install"
}
// BackendInstallRequest is the payload for a backend.install NATS request.
type BackendInstallRequest struct {
Backend string `json:"backend"`
ModelID string `json:"model_id,omitempty"`
BackendGalleries string `json:"backend_galleries,omitempty"`
// URI is set for external installs (OCI image, URL, or path). When non-empty
// the worker routes to InstallExternalBackend instead of the gallery lookup.
URI string `json:"uri,omitempty"`
Name string `json:"name,omitempty"`
Alias string `json:"alias,omitempty"`
// ReplicaIndex selects which slot on the worker this load occupies, so two
// concurrent backend.install requests for the same model land on distinct
// gRPC processes and ports. Workers older than this field treat it as 0
// (single-replica behavior — no collision because the controller never
// asks for replica > 0 on a node whose MaxReplicasPerModel is 1).
ReplicaIndex int32 `json:"replica_index,omitempty"`
// Force is retained on the wire only for backward compatibility with
// pre-2026-05-08 masters that did not know about backend.upgrade. New
// callers MUST send to SubjectNodeBackendUpgrade instead. Workers continue
// to honor Force=true here so a rolling update with new master + old
// worker still works (the master's install fallback path also uses this
// when backend.upgrade returns nats.ErrNoResponders).
Force bool `json:"force,omitempty"`
}
// BackendInstallReply is the response from a backend.install NATS request.
type BackendInstallReply struct {
Success bool `json:"success"`
Address string `json:"address,omitempty"` // gRPC address of the backend process (host:port)
Error string `json:"error,omitempty"`
}
// SubjectNodeBackendUpgrade tells a worker node to force-reinstall a backend
// from the gallery, stop every running process for that backend, and restart.
// Uses NATS request-reply with a long deadline (gallery image pulls can take
// many minutes on slow links). Routine model loads use SubjectNodeBackendInstall
// instead — this subject exists so the slow path doesn't head-of-line-block
// the fast one through a shared subscription goroutine.
func SubjectNodeBackendUpgrade(nodeID string) string {
return subjectNodePrefix + sanitizeSubjectToken(nodeID) + ".backend.upgrade"
}
// BackendUpgradeRequest is the payload for a backend.upgrade NATS request.
// It is intentionally a strict subset of BackendInstallRequest — there is no
// Force field because the upgrade subject IS the force semantics; no ModelID
// because upgrade is backend-scoped (it stops every replica using the binary
// before re-installing). Per-replica restart happens on the next routine load.
type BackendUpgradeRequest struct {
Backend string `json:"backend"`
BackendGalleries string `json:"backend_galleries,omitempty"`
URI string `json:"uri,omitempty"`
Name string `json:"name,omitempty"`
Alias string `json:"alias,omitempty"`
// ReplicaIndex is informational — upgrade stops all replicas regardless,
// but the field lets future per-replica metadata (e.g. progress reporting
// scoped to a slot) ride the same wire without a v3 type.
ReplicaIndex int32 `json:"replica_index,omitempty"`
}
// BackendUpgradeReply mirrors BackendInstallReply minus Address — upgrade does
// not start a process, so there is no port to advertise. The subsequent
// routine load will re-bind via backend.install and learn the new address.
type BackendUpgradeReply struct {
Success bool `json:"success"`
Error string `json:"error,omitempty"`
}
// SubjectNodeBackendList queries a worker node for its installed backends.
// Uses NATS request-reply.
func SubjectNodeBackendList(nodeID string) string {
return subjectNodePrefix + sanitizeSubjectToken(nodeID) + ".backend.list"
}
// BackendListRequest is the payload for a backend.list NATS request.
type BackendListRequest struct{}
// BackendListReply is the response from a backend.list NATS request.
type BackendListReply struct {
Backends []NodeBackendInfo `json:"backends"`
Error string `json:"error,omitempty"`
}
// NodeBackendInfo describes a backend installed on a worker node.
type NodeBackendInfo struct {
Name string `json:"name"`
IsSystem bool `json:"is_system"`
IsMeta bool `json:"is_meta"`
InstalledAt string `json:"installed_at,omitempty"`
GalleryURL string `json:"gallery_url,omitempty"`
// Version, URI and Digest enable cluster-wide upgrade detection —
// without them, the frontend cannot tell whether the installed OCI
// image matches the gallery entry, and upgrades silently never surface.
Version string `json:"version,omitempty"`
URI string `json:"uri,omitempty"`
Digest string `json:"digest,omitempty"`
}
// SubjectNodeBackendStop tells a worker node to stop its gRPC backend process.
// Equivalent to the local deleteProcess(). The node will:
// 1. Best-effort Free() via gRPC
// 2. Kill the backend process
// 3. Can be restarted via another backend.start event.
func SubjectNodeBackendStop(nodeID string) string {
return subjectNodePrefix + sanitizeSubjectToken(nodeID) + ".backend.stop"
}
// SubjectNodeBackendDelete tells a worker node to delete a backend (stop + remove files).
// Uses NATS request-reply.
func SubjectNodeBackendDelete(nodeID string) string {
return subjectNodePrefix + sanitizeSubjectToken(nodeID) + ".backend.delete"
}
// BackendDeleteRequest is the payload for a backend.delete NATS request.
type BackendDeleteRequest struct {
Backend string `json:"backend"`
}
// BackendDeleteReply is the response from a backend.delete NATS request.
type BackendDeleteReply struct {
Success bool `json:"success"`
Error string `json:"error,omitempty"`
}
// SubjectNodeModelUnload tells a worker node to unload a model (gRPC Free) without killing the backend.
// Uses NATS request-reply.
func SubjectNodeModelUnload(nodeID string) string {
return subjectNodePrefix + sanitizeSubjectToken(nodeID) + ".model.unload"
}
// ModelUnloadRequest is the payload for a model.unload NATS request.
type ModelUnloadRequest struct {
ModelName string `json:"model_name"`
Address string `json:"address,omitempty"` // gRPC address of the backend process to unload from
}
// ModelUnloadReply is the response from a model.unload NATS request.
type ModelUnloadReply struct {
Success bool `json:"success"`
Error string `json:"error,omitempty"`
}
// SubjectNodeModelDelete tells a worker node to delete model files from disk.
// Uses NATS request-reply.
func SubjectNodeModelDelete(nodeID string) string {
return subjectNodePrefix + sanitizeSubjectToken(nodeID) + ".model.delete"
}
// ModelDeleteRequest is the payload for a model.delete NATS request.
type ModelDeleteRequest struct {
ModelName string `json:"model_name"`
}
// ModelDeleteReply is the response from a model.delete NATS request.
type ModelDeleteReply struct {
Success bool `json:"success"`
Error string `json:"error,omitempty"`
}
// SubjectNodeStop tells a serve-backend node to shut down entirely
// (deregister + exit). The node will not restart the backend process.
func SubjectNodeStop(nodeID string) string {
return subjectNodePrefix + sanitizeSubjectToken(nodeID) + ".stop"
}
// File Staging (Request-Reply — targeted to specific nodes)
// These subjects use request-reply for synchronous file operations.
// SubjectNodeFilesEnsure tells a serve-backend node to download an S3 key to its local cache.
// Reply: {local_path, error}
func SubjectNodeFilesEnsure(nodeID string) string {
return subjectNodePrefix + sanitizeSubjectToken(nodeID) + ".files.ensure"
}
// SubjectNodeFilesStage tells a serve-backend node to upload a local file to S3.
// Reply: {key, error}
func SubjectNodeFilesStage(nodeID string) string {
return subjectNodePrefix + sanitizeSubjectToken(nodeID) + ".files.stage"
}
// SubjectNodeFilesTemp tells a serve-backend node to allocate a temp file.
// Reply: {local_path, error}
func SubjectNodeFilesTemp(nodeID string) string {
return subjectNodePrefix + sanitizeSubjectToken(nodeID) + ".files.temp"
}
// SubjectNodeFilesListDir tells a serve-backend node to list files in a directory.
// Reply: {files: [...], error}
func SubjectNodeFilesListDir(nodeID string) string {
return subjectNodePrefix + sanitizeSubjectToken(nodeID) + ".files.listdir"
}
// Cache Invalidation (Pub/Sub — broadcast to all instances)
const (
SubjectCacheInvalidateSkills = "cache.invalidate.skills"
)
// SubjectCacheInvalidateCollection returns the NATS subject for collection cache invalidation.
func SubjectCacheInvalidateCollection(name string) string {
return "cache.invalidate.collections." + sanitizeSubjectToken(name)
}