mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-29 19:19:19 -04:00
* feat(distributed): add configurable NATS backend install/upgrade timeouts Adds BackendInstallTimeout and BackendUpgradeTimeout to DistributedConfig with 15m defaults, following the existing MCPToolTimeout / WorkerWaitTimeout pattern. These will replace the hardcoded literals in RemoteUnloaderAdapter so admin-driven backend installs across the cluster survive long OCI image pulls that previously timed out at 3m. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(distributed): gofmt alignment after timeout fields Re-aligns the Validate() negative-duration map and the Default* const block so the new BackendInstall/UpgradeTimeout entries do not leave the surrounding columns mis-padded. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(cli): surface LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT and _UPGRADE_TIMEOUT Parses the two new env vars on the run CLI and threads them through the existing AppOption builder so DistributedConfig picks them up. Invalid duration strings now fail loudly at startup rather than silently falling back to the default. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): inject NATS install/upgrade timeouts into RemoteUnloaderAdapter Removes the hardcoded 3m / 15m literals from RemoteUnloaderAdapter and threads in DistributedConfig.BackendInstallTimeoutOrDefault() and BackendUpgradeTimeoutOrDefault() at construction. Install now defaults to 15m (was 3m); cold OCI image pulls on Jetson Wi-Fi routinely blew past the old ceiling. Scripted messaging client captures the timeout so tests can assert the configured value actually reaches the NATS request. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): introduce galleryop.ErrWorkerStillInstalling sentinel When the NATS request-reply for backend.install (or .upgrade) times out the worker is almost always still pulling the OCI image. Wrap the timeout in a typed sentinel so the manager above can distinguish "worker hung" from "worker still working" and leave the pending_backend_ops row in place for the reconciler to confirm via backend.list. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): treat NATS install timeout as in-progress, not failure When a worker times out replying to backend.install but the install is still running on the worker, enqueueAndDrainBackendOp now reports a running_on_worker status and pushes NextRetryAt out by the install timeout so the reconciler does not immediately re-fire another install while the worker is still pulling the image. The pending_backend_ops row stays in place for the next reconciler pass to confirm via backend.list. InstallBackend wraps the result in galleryop.ErrWorkerStillInstalling so callers can branch (galleryop renders yellow in-progress instead of red error). UpgradeBackend uses the same wrap. Adds RemoteUnloaderAdapter.InstallTimeout() so the manager can push NextRetryAt by the configured timeout without reaching into a private field, and NodeRegistry.RecordPendingBackendOpInFlight as the soft cousin of RecordPendingBackendOpFailure. Also includes incidental gofmt-driven struct-field alignment in registry.go on lines unrelated to the change (touched files are re-formatted to canonical form per project policy). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed): don't increment Attempts on in-flight install timeout An in-flight timeout (worker still pulling the OCI image) is not a failed attempt, it's a delayed one. Incrementing Attempts let genuinely-progressing slow installs (e.g. 30 GB CUDA images on Wi-Fi) trip the reconciler's maxPendingBackendOpAttempts cap and dead-letter the queue row while the worker was still legitimately working. RecordPendingBackendOpInFlight now only updates LastError and NextRetryAt. Also documents "running_on_worker" in the NodeOpStatus.Status enum comment so Task 6 implementers see the full surface. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): surface ErrWorkerStillInstalling as non-error OpStatus When the distributed backend manager returns an error that wraps ErrWorkerStillInstalling, backendHandler now completes the op with a "still installing in background" message rather than marking it as a red failure. Admin UI sees a yellow in-progress state; reconciler confirms completion on its next pass. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): end-to-end install-timeout-then-reconcile Wires Task 1-6 end-to-end so any seam mismatch surfaces in CI rather than during a real cluster install. NATS times out, the queue row stays alive with running_on_worker status, the worker eventually reports the backend installed via backend.list, the manager surfaces it via ListBackends. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT / _UPGRADE_TIMEOUT Add the two new operator-tunable env vars to the Frontend Configuration table in the distributed-mode docs. Explains the 15m default, when to raise it (slow links pulling multi-GB OCI images), and the new "still installing in background" admin-UI state when the round-trip times out but the worker is still working. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): clear pending install rows when backend.list confirms DistributedBackendManager.ListBackends now proactively clears pending_backend_ops install rows whose (nodeID, backend) is reported installed by backend.list. Operator UI updates immediately instead of waiting up to installTimeout (default 15m) for the next reconciler tick after NextRetryAt. Only install rows are cleared; upgrade and delete intents are not satisfied by presence in backend.list and continue to drain through their normal reconciler paths. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(messaging): add BackendInstallProgressEvent wire type and subject New NATS subject nodes.<nodeID>.backend.install.<opID>.progress lets the worker publish transient progress events (file, current/total bytes, percentage, phase) while a long-running install pulls its OCI image. BackendInstallRequest gains an optional OpID field so the worker knows which subject to publish on. Transient pub/sub (not JetStream): the install reply remains ground truth for success/failure; dropped progress events are tolerable. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(messaging): drop em-dash from BackendInstallProgress test comment Per project convention (no em-dashes anywhere). Comment substance is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): worker publishes debounced install progress over NATS When BackendInstallRequest.OpID is set, the worker's backend.install handler wires a debounced publisher (250ms window) into the gallery download callback. Each tick becomes a BackendInstallProgressEvent on nodes.<nodeID>.backend.install.<opID>.progress; the publisher always emits a final event on Flush so the UI sees the terminal percentage. Old masters that do not set OpID continue to run silent installs: no behavior change for them. Lock ordering: the publisher releases its mutex before calling messaging.Publish so a slow network never stalls the install loop. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): RemoteUnloaderAdapter subscribes to install progress InstallBackend gains opID + onProgress parameters. When both are set, the adapter subscribes to nodes.<nodeID>.backend.install.<opID>.progress BEFORE publishing the install request, decodes each message into the caller's onProgress callback in a goroutine (so a slow callback never stalls the NATS reader thread), and unsubscribes after RequestJSON returns. When onProgress is nil OR opID is empty (the reconciler retry path), subscription is skipped entirely - silent installs cost nothing extra. Subscribe failure is logged at Warn and the install proceeds without progress streaming; the NATS round-trip still owns terminal status. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): forward backend install progress into galleryop OpStatus DistributedBackendManager.InstallBackend now passes the gallery op ID and a progress bridge into the adapter call. Each BackendInstallProgressEvent from the worker becomes a galleryop.ProgressCallback tick - which the existing backendHandler already turns into OpStatus.UpdateStatus, so the admin UI/SSE polling sees per-byte progress for distributed installs without any UI-side change. UpgradeBackend is intentionally left silent for now: its wire request (BackendUpgradeRequest) does not carry OpID, and rolling-update fallback is the rarer path. Will be picked up in a follow-up if the worker upgrade path also gets a progress channel. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): InstallBackend tolerates silent (pre-Phase-2) workers A worker on pre-Phase-2 code never publishes progress events. The new master subscribes optimistically; this spec pins that a silent worker still produces a green install with no progressCb ticks. The install reply is the source of truth for terminal state; the progress stream is a best-effort UX enrichment. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document install progress streaming Note the new nodes.<nodeID>.backend.install.<opID>.progress subject and the silent-worker compatibility behavior so operators know to expect real-time progress and what happens on a mixed-version cluster. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): note progress-event ordering trade-off in InstallBackend Document near the goroutine dispatch why ordering at the consumer is best-effort, why it rarely matters in practice (worker debounce >> goroutine jitter), and what a future hardening pass would look like (Seq field + stale-by-seq drop). Stops the next reader from accidentally "fixing" the goroutine pool away. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): add NodeProgress + OpStatus.Nodes for per-node breakdown Adds the data model the UI needs to render an expandable per-node breakdown of a fanned-out backend install. NodeProgress carries node identity (ID + name), per-node status (queued / running_on_worker / success / error / downloading), the current file + bytes + percentage from the Phase 2 progress stream, and any per-node error. OpStatus.Nodes is the slice the /api/operations handler will surface in a follow-up. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): UpdateNodeProgress merges per-node ticks by NodeID GalleryService.UpdateNodeProgress(opID, nodeID, np) merges a NodeProgress into OpStatus.Nodes (keyed by NodeID, no duplicates) and mirrors the latest tick into the aggregate Progress / FileName / DownloadedFileSize / TotalFileSize fields so the legacy single-bar OperationsBar view keeps working unchanged alongside the new per-node breakdown. Concurrent-safe via the existing g.Mutex. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): write per-node OpStatus entries during install fan-out DistributedBackendManager now accepts a nodeProgressSink and feeds it two streams: 1. enqueueAndDrainBackendOp emits a per-node terminal entry on each status it appends to BackendOpResult (queued, success, error, running_on_worker). The opID is threaded through the function so the sink gets the right gallery op identity. 2. The install apply closure fans each BackendInstallProgressEvent into the sink as a downloading entry, alongside the legacy progressCb path so the aggregate single-bar view stays correct. Production wiring passes the GalleryService (which implements UpdateNodeProgress via Task 2) as the sink. Single-node tests pass nil. DeleteBackend and UpgradeBackend pass an empty opID so the sink path no-ops for ops that aren't gallery-tracked the same way as Install. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(operations): expose per-node breakdown on /api/operations When an operation's OpStatus has Nodes entries (populated by the Phase 4 progress sink wiring), surface them as a "nodes" array on the /api/operations response, sorted by node_name for stable rendering. Backward compatible: legacy clients ignore the field; ops without any node entries (single-node mode, model installs) omit the array entirely thanks to the empty-slice guard. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): per-node breakdown in OperationsBar When an install op fans out to more than one worker, the operations bar now shows a "N nodes" chevron that expands into a per-node list. Each row carries the node's status (color-coded pill), the current file being downloaded, byte counts, percentage, and a thin per-node progress bar. Yellow "Worker busy" pill marks running_on_worker status with a tooltip explaining the NATS round-trip timed out but the worker is still installing in the background. Backward compatible: ops without a nodes field (legacy or single-node mode) render as before. State for expand/collapse is local to the component, keyed by jobID/id - reload starts collapsed. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document per-node breakdown in the operations bar Adds a short subsection covering the expandable "N nodes" chevron in the OperationsBar admin UI, the meaning of each status pill, and how it relates to the /api/operations nodes array. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(galleryop): UpdateStatus preserves Nodes when caller sends none Real-world bug surfaced by the Phase 4 multi-worker smoke test: the nodes[] array in /api/operations flickered between a single node at a time on a 2-worker install. Root cause: the Phase 2 progress bridge also calls the legacy progressCb -> UpdateStatus(&OpStatus{...}) on every tick. UpdateStatus then overwrote the entire status pointer, wiping the Nodes slice that UpdateNodeProgress had just merged in. Fix: in UpdateStatus, if the incoming op has an empty Nodes slice, carry forward the previous status's Nodes before storing. Callers that explicitly populate Nodes still win (their slice replaces the prior one, no merge across the two code paths). Two regression specs added pinning both directions of the contract. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): strip implementation details from user-facing docs Trim the new install/upgrade timeout rows and the install-progress sections to focus on what the operator sees and tunes. Drops: - the NATS subject names and pub/sub mechanics - "round-trip" / reconciler / backend.list jargon - /api/operations polling cadence - "pre-2026-05-22" version references Reframes the breakdown text around the admin UI (Operations Bar, chevron, status pills, "Worker busy" tooltip). Implementation context lives in the agent notes and code comments. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(config): move DistributedConfig.Validate flag names to constants The negative-duration check map was a wall of literal kebab-case strings that had to stay in sync with the kong-derived CLI flag names manually. Move them to a Flag* const block alongside the existing Default* block so a rename of either the Go field or the CLI naming convention forces a compile error rather than silent drift. Sole consumer today is Validate; the constants are exported so future operator-facing surfaces (e.g. error messages on other validation paths) can reference them by name instead of repeating the literals. Tests pin both the literal values (so a future "let's just rename this" doesn't accidentally regress the CLI flag) and the negative- duration error message for the new BackendInstall / BackendUpgrade fields. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(distributed): extract NodeStatus and Phase enums to constants Sweep for the same literal-string-as-identifier pattern called out on the Validate flag names: the per-node install status enum ("queued" | "downloading" | "running_on_worker" | "success" | "error") appeared as raw literals across managers_distributed.go (10+ sites, including 3 separate `n.Status == "running_on_worker"` checks), operation.go, and the test suite. Same shape for the Phase enum ("resolving" | "downloading" | "extracting" | "starting") in the worker-side progress publisher. Promote both to exported const blocks: - galleryop.NodeStatus{Queued,Downloading,RunningOnWorker,Success,Error} shared between galleryop.NodeProgress.Status (the wire field) and nodes.NodeOpStatus.Status (the in-process per-node summary) - messaging.Phase{Resolving,Downloading,Extracting,Starting} shared between the worker publisher and any future consumer that needs to switch on phase Tests pin both the literal values (so a future "let's just rename" doesn't silently change the JSON wire) and use the constants in setup (so the producer side stays drift-protected). Wire-format assertions on the /api/operations JSON output keep their literals deliberately, so the constant value can never silently diverge from what the UI receives. Out of scope for this PR (separate cleanup): the finetune and quantization job-status enums have the same anti-pattern with 14+ literal sites each, but predate this PR's work. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
631 lines
26 KiB
Go
631 lines
26 KiB
Go
package nodes
|
|
|
|
import (
|
|
"context"
|
|
"encoding/json"
|
|
"errors"
|
|
"fmt"
|
|
"strings"
|
|
|
|
"github.com/mudler/LocalAI/core/config"
|
|
"github.com/mudler/LocalAI/core/gallery"
|
|
"github.com/mudler/LocalAI/core/services/galleryop"
|
|
"github.com/mudler/LocalAI/core/services/messaging"
|
|
"github.com/mudler/LocalAI/pkg/model"
|
|
"github.com/mudler/LocalAI/pkg/system"
|
|
"github.com/mudler/xlog"
|
|
"github.com/nats-io/nats.go"
|
|
)
|
|
|
|
// DistributedModelManager wraps a local ModelManager and adds NATS fan-out
|
|
// for model deletion so worker nodes clean up stale files.
|
|
type DistributedModelManager struct {
|
|
local galleryop.ModelManager
|
|
adapter *RemoteUnloaderAdapter
|
|
}
|
|
|
|
// NewDistributedModelManager creates a DistributedModelManager.
|
|
// Backend auto-install is disabled because the frontend node delegates
|
|
// inference to workers and never runs backends locally.
|
|
func NewDistributedModelManager(appConfig *config.ApplicationConfig, ml *model.ModelLoader, adapter *RemoteUnloaderAdapter) *DistributedModelManager {
|
|
local := galleryop.NewLocalModelManager(appConfig, ml)
|
|
local.SetAutoInstallBackend(false)
|
|
return &DistributedModelManager{
|
|
local: local,
|
|
adapter: adapter,
|
|
}
|
|
}
|
|
|
|
func (d *DistributedModelManager) DeleteModel(name string) error {
|
|
err := d.local.DeleteModel(name)
|
|
// Best-effort: fan out model.delete to worker nodes
|
|
if rcErr := d.adapter.DeleteModelFiles(name); rcErr != nil {
|
|
xlog.Warn("Failed to propagate model file deletion to workers", "model", name, "error", rcErr)
|
|
}
|
|
return err
|
|
}
|
|
|
|
func (d *DistributedModelManager) InstallModel(ctx context.Context, op *galleryop.ManagementOp[gallery.GalleryModel, gallery.ModelConfig], progressCb galleryop.ProgressCallback) error {
|
|
return d.local.InstallModel(ctx, op, progressCb)
|
|
}
|
|
|
|
// nodeProgressSink is the narrow interface DistributedBackendManager uses to
|
|
// publish per-node progress without dragging in the full *GalleryService.
|
|
// nil means "no sink, skip per-node writes" (used by single-node tests).
|
|
type nodeProgressSink interface {
|
|
UpdateNodeProgress(opID, nodeID string, np galleryop.NodeProgress)
|
|
}
|
|
|
|
// DistributedBackendManager wraps a local BackendManager and adds NATS fan-out
|
|
// for backend deletion so worker nodes clean up stale files.
|
|
type DistributedBackendManager struct {
|
|
local galleryop.BackendManager
|
|
adapter *RemoteUnloaderAdapter
|
|
registry *NodeRegistry
|
|
backendGalleries []config.Gallery
|
|
systemState *system.SystemState
|
|
progressSink nodeProgressSink
|
|
}
|
|
|
|
// NewDistributedBackendManager creates a DistributedBackendManager.
|
|
// progressSink may be nil to disable per-node OpStatus writes (single-node
|
|
// tests don't need it).
|
|
func NewDistributedBackendManager(appConfig *config.ApplicationConfig, ml *model.ModelLoader, adapter *RemoteUnloaderAdapter, registry *NodeRegistry, progressSink nodeProgressSink) *DistributedBackendManager {
|
|
return &DistributedBackendManager{
|
|
local: galleryop.NewLocalBackendManager(appConfig, ml),
|
|
adapter: adapter,
|
|
registry: registry,
|
|
backendGalleries: appConfig.BackendGalleries,
|
|
systemState: appConfig.SystemState,
|
|
progressSink: progressSink,
|
|
}
|
|
}
|
|
|
|
// NodeOpStatus is the per-node outcome of a backend lifecycle operation.
|
|
// Returned as part of BackendOpResult so the frontend can surface exactly
|
|
// what happened on each worker instead of a single joined error string.
|
|
// Status holds one of the galleryop.NodeStatus* constants.
|
|
type NodeOpStatus struct {
|
|
NodeID string `json:"node_id"`
|
|
NodeName string `json:"node_name"`
|
|
Status string `json:"status"`
|
|
Error string `json:"error,omitempty"`
|
|
}
|
|
|
|
// BackendOpResult aggregates per-node outcomes.
|
|
type BackendOpResult struct {
|
|
Nodes []NodeOpStatus `json:"nodes"`
|
|
}
|
|
|
|
// Err returns a non-nil error aggregating per-node hard failures
|
|
// (Status == "error"). Queued nodes (waiting for reconciler retry) are not
|
|
// failures — surfacing them as errors would mislead users about durable
|
|
// intent. Used by Install/Upgrade/Delete so reply.Success=false from
|
|
// workers reaches OpStatus.Error and the UI, instead of being silently
|
|
// dropped on the way up.
|
|
func (r BackendOpResult) Err() error {
|
|
var failures []string
|
|
for _, n := range r.Nodes {
|
|
if n.Status == galleryop.NodeStatusError {
|
|
failures = append(failures, fmt.Sprintf("%s: %s", n.NodeName, n.Error))
|
|
}
|
|
}
|
|
if len(failures) == 0 {
|
|
return nil
|
|
}
|
|
return errors.New(strings.Join(failures, "; "))
|
|
}
|
|
|
|
// enqueueAndDrainBackendOp is the shared scaffolding for
|
|
// delete/install/upgrade. Every non-pending node gets a pending_backend_ops
|
|
// row (intent is durable even if the node is offline). Currently-healthy
|
|
// nodes get an immediate attempt; success deletes the row, failure records
|
|
// the error and leaves the row for the reconciler to retry.
|
|
//
|
|
// `apply` is the NATS round-trip for one node. Returning an error keeps the
|
|
// row in the queue and marks the per-node status as "error"; returning nil
|
|
// deletes the row and reports "success". For non-healthy nodes the status
|
|
// is "queued" — no attempt is made right now, reconciler will pick it up
|
|
// when the node returns.
|
|
// targetNodeIDs is an optional allowlist: when non-nil, only nodes whose ID is
|
|
// in the set are visited. Used by UpgradeBackend to avoid asking nodes that
|
|
// never had the backend installed to "upgrade" it - such requests fail at the
|
|
// gallery (no platform variant) and would otherwise leave a forever-retrying
|
|
// pending_backend_ops row. nil means "fan out to every node" (Install/Delete).
|
|
//
|
|
// opID is the gallery operation identifier; when non-empty and progressSink is
|
|
// set, every per-node terminal status appended to BackendOpResult is also
|
|
// mirrored into the sink so the UI's per-node OpStatus.Nodes view stays in
|
|
// lockstep with the manager's view. opID may be empty for ops that aren't
|
|
// gallery-tracked (e.g. DeleteBackend's plain code path).
|
|
func (d *DistributedBackendManager) enqueueAndDrainBackendOp(ctx context.Context, opID, op, backend string, galleriesJSON []byte, targetNodeIDs map[string]bool, apply func(node BackendNode) error) (BackendOpResult, error) {
|
|
allNodes, err := d.registry.List(ctx)
|
|
if err != nil {
|
|
return BackendOpResult{}, err
|
|
}
|
|
|
|
// emitNodeProgress is a small helper that funnels every NodeOpStatus we
|
|
// append to result.Nodes into the per-node OpStatus sink (when configured
|
|
// and opID is known). Keeping it inline avoids drift between the
|
|
// BackendOpResult view and the sink view - they're written from the same
|
|
// code path on the same terminal statuses.
|
|
emitNodeProgress := func(node BackendNode, status, errMsg string) {
|
|
if d.progressSink == nil || opID == "" {
|
|
return
|
|
}
|
|
d.progressSink.UpdateNodeProgress(opID, node.ID, galleryop.NodeProgress{
|
|
NodeID: node.ID,
|
|
NodeName: node.Name,
|
|
Status: status,
|
|
Error: errMsg,
|
|
})
|
|
}
|
|
|
|
result := BackendOpResult{Nodes: make([]NodeOpStatus, 0, len(allNodes))}
|
|
for _, node := range allNodes {
|
|
// Pending nodes haven't been approved yet - no intent to apply.
|
|
if node.Status == StatusPending {
|
|
continue
|
|
}
|
|
// Backend lifecycle ops only make sense on backend-type workers.
|
|
// Agent workers don't subscribe to backend.install/delete/list, so
|
|
// enqueueing for them guarantees a forever-retrying row that the
|
|
// reconciler can never drain. Silently skip - they aren't consumers.
|
|
if node.NodeType != "" && node.NodeType != NodeTypeBackend {
|
|
continue
|
|
}
|
|
if targetNodeIDs != nil && !targetNodeIDs[node.ID] {
|
|
continue
|
|
}
|
|
if err := d.registry.UpsertPendingBackendOp(ctx, node.ID, backend, op, galleriesJSON); err != nil {
|
|
xlog.Warn("Failed to enqueue backend op", "op", op, "node", node.Name, "backend", backend, "error", err)
|
|
errMsg := fmt.Sprintf("enqueue failed: %v", err)
|
|
result.Nodes = append(result.Nodes, NodeOpStatus{
|
|
NodeID: node.ID, NodeName: node.Name, Status: galleryop.NodeStatusError,
|
|
Error: errMsg,
|
|
})
|
|
emitNodeProgress(node, galleryop.NodeStatusError, errMsg)
|
|
continue
|
|
}
|
|
|
|
if node.Status != StatusHealthy {
|
|
// Intent is recorded; reconciler will retry when the node recovers.
|
|
errMsg := fmt.Sprintf("node %s, will retry when healthy", node.Status)
|
|
result.Nodes = append(result.Nodes, NodeOpStatus{
|
|
NodeID: node.ID, NodeName: node.Name, Status: galleryop.NodeStatusQueued,
|
|
Error: errMsg,
|
|
})
|
|
emitNodeProgress(node, galleryop.NodeStatusQueued, errMsg)
|
|
continue
|
|
}
|
|
|
|
applyErr := apply(node)
|
|
if applyErr == nil {
|
|
// Find the row we just upserted and delete it; cheap but requires
|
|
// a lookup since UpsertPendingBackendOp doesn't return the ID.
|
|
if err := d.deletePendingRow(ctx, node.ID, backend, op); err != nil {
|
|
xlog.Debug("Failed to clear pending backend op after success", "error", err)
|
|
}
|
|
result.Nodes = append(result.Nodes, NodeOpStatus{
|
|
NodeID: node.ID, NodeName: node.Name, Status: galleryop.NodeStatusSuccess,
|
|
})
|
|
emitNodeProgress(node, galleryop.NodeStatusSuccess, "")
|
|
continue
|
|
}
|
|
|
|
// Record failure for backoff. If it's an ErrNoResponders, the node's
|
|
// gone AWOL - mark unhealthy so the router stops picking it too.
|
|
errMsg := applyErr.Error()
|
|
|
|
// Worker-still-installing is a "soft" failure: the worker is most
|
|
// likely still pulling the OCI image. Keep the row, push NextRetryAt
|
|
// out so the reconciler does not immediately re-fire another install
|
|
// while the worker is still busy, and report the in-progress state
|
|
// to the caller. The next reconciler pass / backend.list confirms
|
|
// the actual outcome.
|
|
if errors.Is(applyErr, galleryop.ErrWorkerStillInstalling) {
|
|
if id, err := d.findPendingRow(ctx, node.ID, backend, op); err == nil {
|
|
_ = d.registry.RecordPendingBackendOpInFlight(ctx, id, errMsg, d.adapter.InstallTimeout())
|
|
}
|
|
result.Nodes = append(result.Nodes, NodeOpStatus{
|
|
NodeID: node.ID, NodeName: node.Name, Status: galleryop.NodeStatusRunningOnWorker, Error: errMsg,
|
|
})
|
|
emitNodeProgress(node, galleryop.NodeStatusRunningOnWorker, errMsg)
|
|
continue
|
|
}
|
|
|
|
if errors.Is(applyErr, nats.ErrNoResponders) {
|
|
xlog.Warn("No NATS responders for node, marking unhealthy", "node", node.Name, "nodeID", node.ID)
|
|
d.registry.MarkUnhealthy(ctx, node.ID)
|
|
}
|
|
if id, err := d.findPendingRow(ctx, node.ID, backend, op); err == nil {
|
|
_ = d.registry.RecordPendingBackendOpFailure(ctx, id, errMsg)
|
|
}
|
|
result.Nodes = append(result.Nodes, NodeOpStatus{
|
|
NodeID: node.ID, NodeName: node.Name, Status: galleryop.NodeStatusError, Error: errMsg,
|
|
})
|
|
emitNodeProgress(node, galleryop.NodeStatusError, errMsg)
|
|
}
|
|
return result, nil
|
|
}
|
|
|
|
// findPendingRow looks up the ID of a pending_backend_ops row by its
|
|
// composite key. Used to hand off to RecordPendingBackendOpFailure /
|
|
// DeletePendingBackendOp after UpsertPendingBackendOp upserts by the same
|
|
// composite key.
|
|
func (d *DistributedBackendManager) findPendingRow(ctx context.Context, nodeID, backend, op string) (uint, error) {
|
|
var row PendingBackendOp
|
|
if err := d.registry.db.WithContext(ctx).
|
|
Where("node_id = ? AND backend = ? AND op = ?", nodeID, backend, op).
|
|
First(&row).Error; err != nil {
|
|
return 0, err
|
|
}
|
|
return row.ID, nil
|
|
}
|
|
|
|
// deletePendingRow removes the queue row keyed by (nodeID, backend, op).
|
|
func (d *DistributedBackendManager) deletePendingRow(ctx context.Context, nodeID, backend, op string) error {
|
|
return d.registry.db.WithContext(ctx).
|
|
Where("node_id = ? AND backend = ? AND op = ?", nodeID, backend, op).
|
|
Delete(&PendingBackendOp{}).Error
|
|
}
|
|
|
|
// DeleteBackend fans out backend deletion to every known node. The previous
|
|
// implementation silently skipped non-healthy nodes, which meant zombies
|
|
// reappeared once those nodes returned. Now the intent is durable — see
|
|
// enqueueAndDrainBackendOp — and the reconciler catches up later.
|
|
func (d *DistributedBackendManager) DeleteBackend(name string) error {
|
|
// Local delete first (frontend rarely has backends installed in
|
|
// distributed mode, but the gallery operation still expects it; ignore
|
|
// "not found" which is the common case).
|
|
if err := d.local.DeleteBackend(name); err != nil {
|
|
if !errors.Is(err, gallery.ErrBackendNotFound) {
|
|
return err
|
|
}
|
|
xlog.Debug("Backend not found locally, will attempt deletion on workers", "backend", name)
|
|
}
|
|
|
|
ctx := context.Background()
|
|
// Empty opID: plain DeleteBackend isn't gallery-tracked the same way as
|
|
// Install/Upgrade (no progress dialog), so we skip the per-node sink
|
|
// writes here. DeleteBackendDetailed is the HTTP path that surfaces
|
|
// per-node results in its own response.
|
|
result, err := d.enqueueAndDrainBackendOp(ctx, "", OpBackendDelete, name, nil, nil, func(node BackendNode) error {
|
|
reply, err := d.adapter.DeleteBackend(node.ID, name)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
if !reply.Success {
|
|
return fmt.Errorf("delete failed: %s", reply.Error)
|
|
}
|
|
return nil
|
|
})
|
|
if err != nil {
|
|
return err
|
|
}
|
|
return result.Err()
|
|
}
|
|
|
|
// DeleteBackendDetailed is the per-node-result variant called by the HTTP
|
|
// handler so the UI can render a per-node status drawer. DeleteBackend still
|
|
// returns error-only for callers that don't care about node breakdown.
|
|
func (d *DistributedBackendManager) DeleteBackendDetailed(ctx context.Context, name string) (BackendOpResult, error) {
|
|
if err := d.local.DeleteBackend(name); err != nil && !errors.Is(err, gallery.ErrBackendNotFound) {
|
|
return BackendOpResult{}, err
|
|
}
|
|
return d.enqueueAndDrainBackendOp(ctx, "", OpBackendDelete, name, nil, nil, func(node BackendNode) error {
|
|
reply, err := d.adapter.DeleteBackend(node.ID, name)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
if !reply.Success {
|
|
return fmt.Errorf("delete failed: %s", reply.Error)
|
|
}
|
|
return nil
|
|
})
|
|
}
|
|
|
|
// ListBackends aggregates installed backends from all worker nodes, preserving
|
|
// per-node attribution. Each SystemBackend.Nodes entry records which node has
|
|
// the backend and the version/digest it reports. The top-level Metadata is
|
|
// populated from the first node seen so single-node-minded callers still work.
|
|
//
|
|
// Pending/offline/draining nodes are skipped because they aren't expected to
|
|
// answer NATS requests; unhealthy nodes are still queried — ErrNoResponders
|
|
// then marks them unhealthy and the loop continues.
|
|
func (d *DistributedBackendManager) ListBackends() (gallery.SystemBackends, error) {
|
|
result := make(gallery.SystemBackends)
|
|
allNodes, err := d.registry.List(context.Background())
|
|
if err != nil {
|
|
return result, err
|
|
}
|
|
|
|
for _, node := range allNodes {
|
|
if node.Status == StatusPending || node.Status == StatusOffline || node.Status == StatusDraining {
|
|
continue
|
|
}
|
|
reply, err := d.adapter.ListBackends(node.ID)
|
|
if err != nil {
|
|
if errors.Is(err, nats.ErrNoResponders) {
|
|
xlog.Warn("No NATS responders for node, marking unhealthy", "node", node.Name, "nodeID", node.ID)
|
|
d.registry.MarkUnhealthy(context.Background(), node.ID)
|
|
continue
|
|
}
|
|
xlog.Warn("Failed to list backends on worker", "node", node.Name, "error", err)
|
|
continue
|
|
}
|
|
if reply.Error != "" {
|
|
xlog.Warn("Worker returned error listing backends", "node", node.Name, "error", reply.Error)
|
|
continue
|
|
}
|
|
for _, b := range reply.Backends {
|
|
ref := gallery.NodeBackendRef{
|
|
NodeID: node.ID,
|
|
NodeName: node.Name,
|
|
NodeStatus: node.Status,
|
|
Version: b.Version,
|
|
Digest: b.Digest,
|
|
URI: b.URI,
|
|
InstalledAt: b.InstalledAt,
|
|
}
|
|
entry, exists := result[b.Name]
|
|
if !exists {
|
|
entry = gallery.SystemBackend{
|
|
Name: b.Name,
|
|
IsSystem: b.IsSystem,
|
|
IsMeta: b.IsMeta,
|
|
Metadata: &gallery.BackendMetadata{
|
|
Name: b.Name,
|
|
InstalledAt: b.InstalledAt,
|
|
GalleryURL: b.GalleryURL,
|
|
Version: b.Version,
|
|
URI: b.URI,
|
|
Digest: b.Digest,
|
|
},
|
|
}
|
|
}
|
|
entry.Nodes = append(entry.Nodes, ref)
|
|
result[b.Name] = entry
|
|
}
|
|
}
|
|
|
|
// Proactively clear pending_backend_ops install rows whose intent is now
|
|
// satisfied: the backend is reported installed on its target node. Without
|
|
// this, the row sits in the queue until next_retry_at expires (up to the
|
|
// install timeout, default 15m) and the operator UI shows the install as
|
|
// "still installing in background" for that whole window even though the
|
|
// worker has actually been ready for minutes. We only clear install rows;
|
|
// upgrade and delete rows have presence-based semantics that do NOT match
|
|
// backend.list confirmation.
|
|
d.clearSatisfiedInstallRows(context.Background(), result)
|
|
return result, nil
|
|
}
|
|
|
|
// clearSatisfiedInstallRows removes pending_backend_ops install rows whose
|
|
// (nodeID, backend) pair now appears in the cluster-wide backend listing.
|
|
// Called by ListBackends after fan-out so the proactive clear sees every
|
|
// node's report. Best-effort: a DB failure is logged and the row stays for
|
|
// the reconciler to drain via its slower path.
|
|
func (d *DistributedBackendManager) clearSatisfiedInstallRows(ctx context.Context, backends gallery.SystemBackends) {
|
|
rows, err := d.registry.ListPendingBackendOps(ctx)
|
|
if err != nil {
|
|
xlog.Debug("clearSatisfiedInstallRows: failed to list pending ops", "error", err)
|
|
return
|
|
}
|
|
if len(rows) == 0 {
|
|
return
|
|
}
|
|
// Build a (nodeID, backend) presence set from the listing.
|
|
present := make(map[string]map[string]bool, len(backends))
|
|
for name, b := range backends {
|
|
for _, ref := range b.Nodes {
|
|
if present[ref.NodeID] == nil {
|
|
present[ref.NodeID] = make(map[string]bool)
|
|
}
|
|
present[ref.NodeID][name] = true
|
|
}
|
|
}
|
|
for _, row := range rows {
|
|
if row.Op != OpBackendInstall {
|
|
continue
|
|
}
|
|
if !present[row.NodeID][row.Backend] {
|
|
continue
|
|
}
|
|
if err := d.registry.DeletePendingBackendOp(ctx, row.ID); err != nil {
|
|
xlog.Debug("clearSatisfiedInstallRows: delete failed",
|
|
"id", row.ID, "node", row.NodeID, "backend", row.Backend, "error", err)
|
|
continue
|
|
}
|
|
xlog.Info("Reconciler: pending install row satisfied by backend.list",
|
|
"node", row.NodeID, "backend", row.Backend)
|
|
}
|
|
}
|
|
|
|
// InstallBackend fans out installation through the pending-ops queue so
|
|
// non-healthy nodes get retried when they come back instead of being silently
|
|
// skipped. Reply success from the NATS round-trip deletes the queue row;
|
|
// reply.Success==false is treated as an error so the row stays for retry.
|
|
//
|
|
// When op.TargetNodeID is set, only that node is visited - the same allowlist
|
|
// path UpgradeBackend uses. Empty TargetNodeID preserves the original fan-out
|
|
// behavior so the periodic reconciler and /api/backends/install/:id keep
|
|
// working unchanged.
|
|
func (d *DistributedBackendManager) InstallBackend(ctx context.Context, op *galleryop.ManagementOp[gallery.GalleryBackend, any], progressCb galleryop.ProgressCallback) error {
|
|
galleriesJSON, _ := json.Marshal(op.Galleries)
|
|
backendName := op.GalleryElementName
|
|
|
|
var targetNodeIDs map[string]bool
|
|
if op.TargetNodeID != "" {
|
|
targetNodeIDs = map[string]bool{op.TargetNodeID: true}
|
|
}
|
|
|
|
result, err := d.enqueueAndDrainBackendOp(ctx, op.ID, OpBackendInstall, backendName, galleriesJSON, targetNodeIDs, func(node BackendNode) error {
|
|
// onProgress fans each BackendInstallProgressEvent into two
|
|
// observers: the legacy single-bar progressCb (kept so callers
|
|
// that only consume the aggregate view keep working) and the
|
|
// per-node sink (so OpStatus.Nodes gets a "downloading" tick
|
|
// per file/percentage with node attribution). Defined inside the
|
|
// loop so each node captures its own node.Name into the closure.
|
|
onProgress := func(ev messaging.BackendInstallProgressEvent) {
|
|
if progressCb != nil {
|
|
progressCb(ev.FileName, ev.Current, ev.Total, ev.Percentage)
|
|
}
|
|
if d.progressSink != nil && op.ID != "" {
|
|
d.progressSink.UpdateNodeProgress(op.ID, ev.NodeID, galleryop.NodeProgress{
|
|
NodeID: ev.NodeID,
|
|
NodeName: node.Name,
|
|
Status: galleryop.NodeStatusDownloading,
|
|
FileName: ev.FileName,
|
|
Current: ev.Current,
|
|
Total: ev.Total,
|
|
Percentage: ev.Percentage,
|
|
Phase: ev.Phase,
|
|
})
|
|
}
|
|
}
|
|
// nil-callback shortcut: when there is nothing to deliver to,
|
|
// hand the adapter a nil onProgress so it skips the per-op NATS
|
|
// subscription. Matches the pre-Phase-4 bridgeProgressCb semantics.
|
|
var onProgressArg func(messaging.BackendInstallProgressEvent)
|
|
if progressCb != nil || d.progressSink != nil {
|
|
onProgressArg = onProgress
|
|
}
|
|
// Admin-driven backend install: not tied to a specific replica slot.
|
|
// Pass replica 0 - the worker's processKey is "backend#0" when no
|
|
// modelID is supplied, matching pre-PR4 behavior.
|
|
reply, err := d.adapter.InstallBackend(node.ID, backendName, "", string(galleriesJSON), op.ExternalURI, op.ExternalName, op.ExternalAlias, 0, op.ID, onProgressArg)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
if !reply.Success {
|
|
return fmt.Errorf("install failed: %s", reply.Error)
|
|
}
|
|
return nil
|
|
})
|
|
if err != nil {
|
|
return err
|
|
}
|
|
if hardErr := result.Err(); hardErr != nil {
|
|
return hardErr
|
|
}
|
|
// No hard failures, but if at least one node reported running_on_worker,
|
|
// surface a wrapped ErrWorkerStillInstalling so galleryop can render a
|
|
// yellow in-progress state instead of green success. The reconciler
|
|
// will confirm the actual outcome on its next pass via backend.list.
|
|
for _, n := range result.Nodes {
|
|
if n.Status == galleryop.NodeStatusRunningOnWorker {
|
|
return fmt.Errorf("%w: %s", galleryop.ErrWorkerStillInstalling, summarizeRunningOnWorker(result.Nodes))
|
|
}
|
|
}
|
|
return nil
|
|
}
|
|
|
|
// UpgradeBackend uses a separate NATS subject (backend.upgrade) so the slow
|
|
// force-reinstall path doesn't head-of-line-block routine model loads on
|
|
// the same worker. Only nodes that already report this backend as installed
|
|
// are targeted — fanning out to every node would ask workers to "upgrade"
|
|
// something they never had, which fails at the gallery (e.g. a darwin/arm64
|
|
// worker has no platform variant for a linux-only backend) and leaves a
|
|
// forever-retrying pending_backend_ops row.
|
|
//
|
|
// Rolling-update fallback: when a worker returns nats.ErrNoResponders on
|
|
// backend.upgrade, we try the legacy backend.install Force=true path so a
|
|
// new master + old worker still converges. Drop the fallback once every
|
|
// worker in the fleet is on 2026-05-08 or newer.
|
|
func (d *DistributedBackendManager) UpgradeBackend(ctx context.Context, name string, progressCb galleryop.ProgressCallback) error {
|
|
galleriesJSON, _ := json.Marshal(d.backendGalleries)
|
|
|
|
installed, err := d.ListBackends()
|
|
if err != nil {
|
|
return fmt.Errorf("failed to list cluster backends: %w", err)
|
|
}
|
|
entry, ok := installed[name]
|
|
if !ok || len(entry.Nodes) == 0 {
|
|
return fmt.Errorf("backend %q is not installed on any node", name)
|
|
}
|
|
targetNodeIDs := make(map[string]bool, len(entry.Nodes))
|
|
for _, n := range entry.Nodes {
|
|
targetNodeIDs[n.NodeID] = true
|
|
}
|
|
|
|
// Empty opID: the caller (galleryop) doesn't thread an op ID into
|
|
// UpgradeBackend today, so we can't tag per-node sink writes with the
|
|
// right OpStatus key. Until the upgrade path takes a ManagementOp the
|
|
// way InstallBackend does, the sink stays no-op here.
|
|
result, err := d.enqueueAndDrainBackendOp(ctx, "", OpBackendUpgrade, name, galleriesJSON, targetNodeIDs, func(node BackendNode) error {
|
|
reply, err := d.adapter.UpgradeBackend(node.ID, name, string(galleriesJSON), "", "", "", 0)
|
|
if err != nil {
|
|
// Rolling-update fallback: an older worker doesn't know
|
|
// backend.upgrade. Try the legacy install-with-force path.
|
|
if errors.Is(err, nats.ErrNoResponders) {
|
|
instReply, instErr := d.adapter.installWithForceFallback(node.ID, name, string(galleriesJSON), "", "", "", 0)
|
|
if instErr != nil {
|
|
return instErr
|
|
}
|
|
if !instReply.Success {
|
|
return fmt.Errorf("upgrade (legacy fallback) failed: %s", instReply.Error)
|
|
}
|
|
return nil
|
|
}
|
|
return err
|
|
}
|
|
if !reply.Success {
|
|
return fmt.Errorf("upgrade failed: %s", reply.Error)
|
|
}
|
|
return nil
|
|
})
|
|
if err != nil {
|
|
return err
|
|
}
|
|
if hardErr := result.Err(); hardErr != nil {
|
|
return hardErr
|
|
}
|
|
// Same in-progress surfacing as InstallBackend: a long-running worker
|
|
// upgrade that timed out the NATS round-trip must not be reported as
|
|
// green success.
|
|
for _, n := range result.Nodes {
|
|
if n.Status == galleryop.NodeStatusRunningOnWorker {
|
|
return fmt.Errorf("%w: %s", galleryop.ErrWorkerStillInstalling, summarizeRunningOnWorker(result.Nodes))
|
|
}
|
|
}
|
|
return nil
|
|
}
|
|
|
|
// IsDistributed reports that installs from this manager fan out across the
|
|
// cluster. The HTTP layer reads this to gate hardware-specific installs on
|
|
// /api/backends/apply (which would otherwise silently land on every node).
|
|
func (d *DistributedBackendManager) IsDistributed() bool { return true }
|
|
|
|
// CheckUpgrades checks for available backend upgrades across the cluster.
|
|
//
|
|
// The previous implementation delegated to d.local, which called
|
|
// ListSystemBackends on the frontend — but in distributed mode the frontend
|
|
// has no backends installed locally, so the upgrade loop never ran and the UI
|
|
// never surfaced any upgrades. We now feed the cluster-wide aggregation
|
|
// (including per-node versions/digests) into gallery.CheckUpgradesAgainst so
|
|
// digest-based detection actually works and cluster drift is visible.
|
|
func (d *DistributedBackendManager) CheckUpgrades(ctx context.Context) (map[string]gallery.UpgradeInfo, error) {
|
|
installed, err := d.ListBackends()
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
// systemState is used by AvailableBackends (gallery paths + meta-backend
|
|
// resolution). The `installed` argument is what the old code got wrong —
|
|
// it used to come from the empty frontend filesystem.
|
|
return gallery.CheckUpgradesAgainst(ctx, d.backendGalleries, d.systemState, installed)
|
|
}
|
|
|
|
// summarizeRunningOnWorker builds a short human-readable summary of which
|
|
// nodes are still installing in the background, for inclusion in the
|
|
// wrapped ErrWorkerStillInstalling error.
|
|
func summarizeRunningOnWorker(nodes []NodeOpStatus) string {
|
|
var names []string
|
|
for _, n := range nodes {
|
|
if n.Status == galleryop.NodeStatusRunningOnWorker {
|
|
names = append(names, n.NodeName)
|
|
}
|
|
}
|
|
return strings.Join(names, ", ")
|
|
}
|