mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-09 01:07:09 -04:00
* fix(galleryop): self-evict terminal ops from OpCache.GetStatus The processingBackends map (the UI 'reinstalling' spinner source) only cleared an op when a client polled /api/backends/job/:uid. The Manage-page Reinstall and Upgrade buttons never poll, so completed installs leaked into processingBackends forever and the backend card spun 'reinstalling' even though the install had finished. Evict terminal ops on the list read instead; DeleteUUID already broadcasts the eviction so peer replicas converge. Reproduced on a live 5-node distributed cluster: 5 backends sat in processingBackends with underlying jobs reporting completed:true,progress:100. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): clear pending backend ops behind offline/draining nodes ListDuePendingBackendOps filters status=healthy, so a backend op queued against a node that went offline (stale heartbeat) or draining (admin action) was never retried, aged out, or deleted - it leaked forever and kept the UI operation spinning. Add DeleteStalePendingBackendOps and run it each reconcile pass: draining nodes are cleared immediately (model rows already purged), offline nodes once their heartbeat is older than a grace window (blip protection). Reproduced on a live cluster: orphaned llama-cpp install rows targeting an offline (nvidia-thor) and a draining (mac-mini-m4) node sat at attempts=0 indefinitely. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): stream per-node progress during backend upgrade The install dispatch subscribed to a per-op progress subject and streamed per-node download ticks; the upgrade dispatch did a bare 15-minute blocking NATS round-trip with no subscription, so the UI showed progress:0 the whole time (the 'reinstalling but nothing happens' report on a slow node). Thread the op ID through BackendManager.UpgradeBackend -> the distributed manager -> the adapter, and have the adapter subscribe to the per-op progress subject before the request (extracted into a shared subscribeProgress helper reused by install/upgrade/force-fallback). The worker's upgradeBackend now creates the same DebouncedInstallProgressPublisher installBackend uses. An upgrade is a force-reinstall, so it reuses SubjectNodeBackendInstallProgress rather than minting a new subject - no new NATS permission, no new rolling-update compat surface. Reconciler-driven retries pass empty opID/onProgress and stay on the silent path. Reproduced on a live cluster: upgrade of llama-cpp-development on agx-orin-slow sat at progress:0 for 4+ minutes with no per-node feedback. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(galleryop): persist cancellation + periodically reap orphaned ops Two distributed gaps surfaced when a replica was killed mid-upgrade on a live cluster, leaving the backend stuck 'processing' in the UI forever: 1. CancelOperation flipped the in-memory status to cancelled and broadcast a NATS event but never persisted the terminal status. On the next replica restart the still-active row re-hydrated straight back into processingBackends and the UI spun again. It now calls store.Cancel(id) so the cancel survives a restart. 2. CleanStale (which marks abandoned active ops failed) only ran once on startup, so an op orphaned AFTER startup - its owning replica's foreground handler goroutine gone - was never reaped until the next restart. Add GalleryService.ReapStaleOperations and run it on a 15m ticker (CleanStale now returns the reaped count for observability). Neither is covered by the OpCache self-evict fix: an orphaned op never reaches Processed, so it would never self-evict. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(review): address self-review findings on the distributed install fixes Three findings from an adversarial review of this branch: 1. CRITICAL - OpCache.GetStatus crashed under concurrent load. m.Map() returns the live internal map by reference, so deleting from it on the read path was an unsynchronized write to a map four HTTP handlers poll every ~1s -> a 'concurrent map writes' fatal. Rewritten to iterate a Keys() snapshot, build a fresh result map, and apply evictions via the locked DeleteUUID after the loop. Added a -race concurrency regression guard. 2. HIGH - GetStatus evicted failed ops too, hiding them from /api/operations and breaking the dismiss-failed-op flow (the panel keeps Error != nil ops so the admin can read the error and click Dismiss). Eviction now fires only for terminal ops with Error == nil (success/cancelled); failures are retained. 3. MEDIUM - DeleteStalePendingBackendOps missed StatusUnhealthy nodes. A node marked unhealthy on a NATS ErrNoResponders never transitions to offline (health.go skips re-marking it), so its pending ops leaked exactly like the offline case. Unhealthy is now reaped via the same stale-heartbeat grace path (a fresh-heartbeat node is recovering and keeps its op). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(review-2): don't evict the still-installing soft-path; don't spin on failed ops Second review pass found two issues: 1. MEDIUM (Go) - OpCache.GetStatus evicted the ErrWorkerStillInstalling soft-path op. That op is deliberately Processed=true with no error to show a yellow in-progress state when a worker timed out the NATS round-trip but is still installing in the background; the reconciler confirms the real outcome later. Evicting it (and broadcasting OpEnd + marking the DB completed) hid an install that may still fail. Eviction is now scoped to a clean success (progress 100 + 'completed', matching the job-poll's historical condition) or a cancellation - the soft-path (progress != 100) and failures are kept. 2. MEDIUM (React) - the Backends gallery card rendered ANY operation as an 'Installing...' spinner, so a failed op (now intentionally kept in the list for the OperationsBar error + Dismiss) spun forever. Exclude errored ops from the card spinner, mirroring Models.jsx (isInstalling already excludes op.error). The error + Dismiss still surface in the global OperationsBar. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ui): refresh Manage backends table when an operation settles The Manage backends table fetched installed backends only on mount/after delete and checked upgrades only on tab activation. After a reinstall/upgrade completed neither re-ran, so the installed-version cell and the 'update available' badge stayed stale until the user switched tabs - the op looked like it 'did nothing'. Watch the operations list (via useOperations) and re-fetch installed backends + available upgrades whenever the count settles, mirroring the operations.length watch Backends.jsx already uses. Consolidates the prior tab-activation upgrades check into the same effect. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
207 lines
8.5 KiB
Go
207 lines
8.5 KiB
Go
package distributed
|
|
|
|
import (
|
|
"fmt"
|
|
"time"
|
|
|
|
"github.com/google/uuid"
|
|
"gorm.io/gorm"
|
|
"gorm.io/gorm/clause"
|
|
)
|
|
|
|
// GalleryOperationRecord tracks model/backend download operations in PostgreSQL.
|
|
//
|
|
// CacheKey and IsBackendOp mirror the in-memory OpCache held by each frontend
|
|
// replica. They are written when a request first lands so a freshly-started
|
|
// (or freshly-routed-to) replica can rebuild its OpCache from this table
|
|
// instead of returning an empty `/api/operations` payload while the real
|
|
// operation is still in flight on a peer.
|
|
type GalleryOperationRecord struct {
|
|
ID string `gorm:"primaryKey;size:36" json:"id"`
|
|
UserID string `gorm:"index;size:36" json:"user_id,omitempty"`
|
|
GalleryElementName string `gorm:"size:255" json:"gallery_element_name"`
|
|
CacheKey string `gorm:"index;size:512" json:"cache_key,omitempty"` // OpCache key (galleryID or node:<id>:<backend>)
|
|
IsBackendOp bool `json:"is_backend_op"` // true if installed via SetBackend
|
|
OpType string `gorm:"size:32" json:"op_type"` // "model_install", "model_delete", "backend_install"
|
|
Status string `gorm:"size:32;default:pending" json:"status"` // pending, downloading, processing, completed, failed, cancelled
|
|
Progress float64 `json:"progress"` // 0.0 to 1.0
|
|
Message string `gorm:"type:text" json:"message,omitempty"`
|
|
Error string `gorm:"type:text" json:"error,omitempty"`
|
|
FileName string `gorm:"size:512" json:"file_name,omitempty"`
|
|
TotalFileSize string `gorm:"size:32" json:"total_file_size,omitempty"`
|
|
DownloadedFileSize string `gorm:"size:32" json:"downloaded_file_size,omitempty"`
|
|
FrontendID string `gorm:"size:36" json:"frontend_id,omitempty"` // which instance is processing
|
|
Cancellable bool `json:"cancellable"`
|
|
CreatedAt time.Time `json:"created_at"`
|
|
UpdatedAt time.Time `json:"updated_at"`
|
|
}
|
|
|
|
// activeStatuses lists the gallery_operations.status values that represent an
|
|
// operation a replica should still surface via /api/operations. Hydration and
|
|
// the dedup lookup share this set so the two paths never disagree about what
|
|
// "still active" means.
|
|
var activeStatuses = []string{"pending", "downloading", "processing"}
|
|
|
|
func (GalleryOperationRecord) TableName() string { return "gallery_operations" }
|
|
|
|
// GalleryStore manages gallery operation state in PostgreSQL.
|
|
type GalleryStore struct {
|
|
db *gorm.DB
|
|
}
|
|
|
|
// NewGalleryStore creates a new GalleryStore and auto-migrates.
|
|
func NewGalleryStore(db *gorm.DB) (*GalleryStore, error) {
|
|
if err := db.AutoMigrate(&GalleryOperationRecord{}); err != nil {
|
|
return nil, fmt.Errorf("migrating gallery_operations: %w", err)
|
|
}
|
|
return &GalleryStore{db: db}, nil
|
|
}
|
|
|
|
// Create stores a new gallery operation. Tolerates a row already existing
|
|
// for this ID — OpCache.Set may have written a placeholder row via
|
|
// UpsertCacheKey before the galleryop service goroutine called Create, and
|
|
// in that case we want to fill in the descriptive columns (gallery element
|
|
// name, op type, status) rather than fail with a primary-key conflict.
|
|
// CacheKey and IsBackendOp are intentionally not in DoUpdates so the
|
|
// placeholder's values win.
|
|
func (s *GalleryStore) Create(op *GalleryOperationRecord) error {
|
|
if op.ID == "" {
|
|
op.ID = uuid.New().String()
|
|
}
|
|
op.CreatedAt = time.Now()
|
|
op.UpdatedAt = op.CreatedAt
|
|
return s.db.Clauses(clause.OnConflict{
|
|
Columns: []clause.Column{{Name: "id"}},
|
|
DoUpdates: clause.AssignmentColumns([]string{
|
|
"gallery_element_name", "op_type", "status",
|
|
"frontend_id", "user_id", "cancellable", "updated_at",
|
|
}),
|
|
}).Create(op).Error
|
|
}
|
|
|
|
// UpdateProgress updates progress for an operation.
|
|
func (s *GalleryStore) UpdateProgress(id string, progress float64, message, downloadedSize string) error {
|
|
return s.db.Model(&GalleryOperationRecord{}).Where("id = ?", id).Updates(map[string]any{
|
|
"progress": progress,
|
|
"message": message,
|
|
"downloaded_file_size": downloadedSize,
|
|
"updated_at": time.Now(),
|
|
}).Error
|
|
}
|
|
|
|
// UpdateStatus updates the status of an operation.
|
|
func (s *GalleryStore) UpdateStatus(id, status, errMsg string) error {
|
|
updates := map[string]any{
|
|
"status": status,
|
|
"updated_at": time.Now(),
|
|
}
|
|
if errMsg != "" {
|
|
updates["error"] = errMsg
|
|
}
|
|
return s.db.Model(&GalleryOperationRecord{}).Where("id = ?", id).Updates(updates).Error
|
|
}
|
|
|
|
// Get retrieves an operation by ID.
|
|
func (s *GalleryStore) Get(id string) (*GalleryOperationRecord, error) {
|
|
var op GalleryOperationRecord
|
|
if err := s.db.First(&op, "id = ?", id).Error; err != nil {
|
|
return nil, err
|
|
}
|
|
return &op, nil
|
|
}
|
|
|
|
// List returns all operations, optionally filtered by status.
|
|
func (s *GalleryStore) List(status string) ([]GalleryOperationRecord, error) {
|
|
var ops []GalleryOperationRecord
|
|
q := s.db.Order("created_at DESC")
|
|
if status != "" {
|
|
q = q.Where("status = ?", status)
|
|
}
|
|
return ops, q.Find(&ops).Error
|
|
}
|
|
|
|
// ListActive returns operations still considered in-flight — used by replicas
|
|
// to rehydrate their in-memory OpCache + statuses on startup. Stale records
|
|
// (older than 30 minutes without an update) are excluded so a crashed peer's
|
|
// orphaned rows never resurrect on a healthy replica; the existing CleanStale
|
|
// reaper eventually marks them failed.
|
|
func (s *GalleryStore) ListActive() ([]GalleryOperationRecord, error) {
|
|
var ops []GalleryOperationRecord
|
|
staleCutoff := time.Now().Add(-30 * time.Minute)
|
|
err := s.db.Where("status IN ? AND updated_at > ?", activeStatuses, staleCutoff).
|
|
Order("created_at DESC").Find(&ops).Error
|
|
return ops, err
|
|
}
|
|
|
|
// UpsertCacheKey records the in-memory OpCache key + IsBackendOp flag on the
|
|
// gallery_operations row, creating the row if it does not exist yet.
|
|
//
|
|
// Why upsert: OpCache.Set is called by the HTTP admission handler before the
|
|
// galleryop service goroutine processes the operation and calls Create. If
|
|
// OpCache wrote with a plain Updates() those columns would silently be lost
|
|
// in the window between the two, so peer replicas hydrating in that window
|
|
// would still rebuild an empty OpCache. Upsert closes that window.
|
|
func (s *GalleryStore) UpsertCacheKey(id, cacheKey string, isBackend bool) error {
|
|
now := time.Now()
|
|
rec := GalleryOperationRecord{
|
|
ID: id,
|
|
CacheKey: cacheKey,
|
|
IsBackendOp: isBackend,
|
|
Status: "pending",
|
|
CreatedAt: now,
|
|
UpdatedAt: now,
|
|
}
|
|
return s.db.Clauses(clause.OnConflict{
|
|
Columns: []clause.Column{{Name: "id"}},
|
|
DoUpdates: clause.Assignments(map[string]any{
|
|
"cache_key": cacheKey,
|
|
"is_backend_op": isBackend,
|
|
"updated_at": now,
|
|
}),
|
|
}).Create(&rec).Error
|
|
}
|
|
|
|
// FindDuplicate checks if another instance is already downloading the same element.
|
|
// Only considers records updated within the last 30 minutes as active — older
|
|
// in-progress records are assumed to be stale (crashed instance).
|
|
func (s *GalleryStore) FindDuplicate(elementName string) (*GalleryOperationRecord, error) {
|
|
var op GalleryOperationRecord
|
|
staleCutoff := time.Now().Add(-30 * time.Minute)
|
|
err := s.db.Where("gallery_element_name = ? AND status IN ? AND updated_at > ?", elementName,
|
|
activeStatuses, staleCutoff).First(&op).Error
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
return &op, nil
|
|
}
|
|
|
|
// Cancel marks an operation as cancelled.
|
|
func (s *GalleryStore) Cancel(id string) error {
|
|
return s.UpdateStatus(id, "cancelled", "")
|
|
}
|
|
|
|
// CleanStale marks abandoned in-progress operations as failed and returns the
|
|
// number of rows reaped. Called on startup AND periodically to recover from
|
|
// crashed/restarted instances that left records in pending/downloading/
|
|
// processing state — an op orphaned after startup would otherwise linger
|
|
// "processing" until the next restart.
|
|
func (s *GalleryStore) CleanStale(age time.Duration) (int64, error) {
|
|
cutoff := time.Now().Add(-age)
|
|
res := s.db.Model(&GalleryOperationRecord{}).
|
|
Where("updated_at < ? AND status IN ?", cutoff, activeStatuses).
|
|
Updates(map[string]any{
|
|
"status": "failed",
|
|
"error": "stale operation reaped (abandoned by a crashed or restarted instance)",
|
|
"updated_at": time.Now(),
|
|
})
|
|
return res.RowsAffected, res.Error
|
|
}
|
|
|
|
// CleanOld removes operations older than the given duration.
|
|
func (s *GalleryStore) CleanOld(retention time.Duration) error {
|
|
cutoff := time.Now().Add(-retention)
|
|
return s.db.Where("created_at < ? AND status IN ?", cutoff,
|
|
[]string{"completed", "failed", "cancelled"}).
|
|
Delete(&GalleryOperationRecord{}).Error
|
|
}
|