fix(distributed): orchestrator resilience — auto-upgrade routing, worker bind-wait, RAG-init crash, log spam (#9657)

* fix(nodes/health): skip stale-marking already-offline nodes

The health monitor re-emitted "Node heartbeat stale" + "Marking stale
node offline" + MarkOffline on every cycle for nodes that were already
in the offline (or unhealthy) state. For an operator-stopped node this
flooded the logs with the same WARN+INFO pair every check interval.

Skip the staleness branch when the node is already StatusOffline /
StatusUnhealthy — the state is already what we'd write, so neither the
log lines nor the DB update carry information.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(worker): wait for backend gRPC bind before replying to backend.install

The backend supervisor used to wait up to 4s (20 × 200ms) for the
backend's gRPC server to answer a HealthCheck, then log a warning and
reply Success with the bind address anyway. On slower nodes (a Jetson
Orin doing first-boot CUDA init, large CGO library load) the gRPC
listener wasn't up yet, so the frontend's first LoadModel dial returned
"connect: connection refused" and the operator chased a phantom network
issue instead of a startup-timing one.

Two changes:

  - Bump the readiness window to 30s. CUDA init on Orin/Thor first boot
    measures in seconds, not milliseconds.
  - On deadline-exceeded, stop the half-started process, recycle the
    port, and return an error with the backend's stderr tail. The
    frontend now gets a real failure with diagnostic context instead of
    a misleading ECONNREFUSED on a downstream dial.

Process death during the wait window keeps its existing fast-fail path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): route auto-upgrade through BackendManager + bump LocalAGI/LocalRecall

Two distributed-mode bugs that surfaced together in the orchestrator
logs:

1. Auto-upgrade always failed with "backend not found".

   UpgradeChecker correctly routed CheckUpgrades through the active
   BackendManager (so the frontend aggregates worker state), but the
   auto-upgrade branch right below called gallery.UpgradeBackend
   directly with the frontend's SystemState. In distributed mode the
   frontend has no backends installed locally, so ListSystemBackends
   returned empty and Get(name) failed for every reported upgrade.
   Auto-upgrade now also goes through BackendManager.UpgradeBackend,
   which fans out to workers via NATS.

2. Embedding-load failure on a remote node crashed the orchestrator.

   When RAG init lazily called NewPersistentPostgresCollection and the
   remote embedding worker was unreachable, LocalRecall called
   os.Exit(1) inside the constructor, killing the orchestrator pod.
   LocalRecall now returns errors instead, LocalAGI surfaces them as a
   nil collection, and the existing RAGProviderFromState path returns
   (nil, nil, false) — the same code path the agent pool already takes
   when no RAG is configured. The orchestrator stays up; chat requests
   degrade to "no RAG available" until the embedding worker recovers.

Bumps:
  github.com/mudler/LocalAGI    → e83bf515d010
  github.com/mudler/localrecall → 6138c1f535ab

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
LocalAI [bot]
2026-05-04 19:09:16 +02:00
committed by GitHub
parent 1aeb4d7e73
commit de83b72bb7
5 changed files with 68 additions and 17 deletions

View File

@@ -465,10 +465,20 @@ func (s *backendSupervisor) startBackend(backend, backendPath string) (string, e
bp := s.processes[backend]
s.mu.Unlock()
// Wait for the gRPC server to be ready
// Wait for the gRPC server to be ready before reporting success.
// Slow nodes (Jetson Orin doing first-boot CUDA init, large CGO libs)
// can take 10-15s before the gRPC port accepts connections; the previous
// 4s window made the worker reply Success on a not-yet-listening port,
// which manifested upstream as "connect: connection refused" on the
// frontend's first LoadModel dial.
client := grpc.NewClientWithToken(clientAddr, false, nil, false, s.cmd.RegistrationToken)
for range 20 {
time.Sleep(200 * time.Millisecond)
const (
readinessPollInterval = 200 * time.Millisecond
readinessTimeout = 30 * time.Second
)
deadline := time.Now().Add(readinessTimeout)
for time.Now().Before(deadline) {
time.Sleep(readinessPollInterval)
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
if ok, _ := client.HealthCheck(ctx); ok {
cancel()
@@ -496,10 +506,23 @@ func (s *backendSupervisor) startBackend(backend, backendPath string) (string, e
}
}
// Log stderr to help diagnose why the backend isn't responding
// Readiness deadline exceeded. Returning success here would leave the
// frontend with an unbound address (it dials, gets ECONNREFUSED, and
// the operator sees a misleading "connection refused" instead of the
// real cause). Stop the half-started process, recycle the port, and
// surface the failure to the caller with the backend's stderr tail.
stderrTail := readLastLinesFromFile(proc.StderrPath(), 20)
xlog.Warn("Backend gRPC server not ready after waiting, proceeding anyway", "backend", backend, "addr", clientAddr, "stderr", stderrTail)
return clientAddr, nil
xlog.Error("Backend gRPC server not ready before deadline; aborting install", "backend", backend, "addr", clientAddr, "timeout", readinessTimeout, "stderr", stderrTail)
if killErr := proc.Stop(); killErr != nil {
xlog.Warn("Failed to stop unready backend process", "backend", backend, "error", killErr)
}
s.mu.Lock()
if cur, ok := s.processes[backend]; ok && cur == bp {
delete(s.processes, backend)
s.freePorts = append(s.freePorts, port)
}
s.mu.Unlock()
return "", fmt.Errorf("backend %s did not become ready within %s. Last stderr:\n%s", backend, readinessTimeout, stderrTail)
}
// resolveProcessKeys turns a caller-supplied identifier into the set of