mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-28 10:27:30 -04:00
* feat(distributed): add SyncedMap cross-replica in-memory state component Introduce core/services/syncstate.SyncedMap[K,V]: a thread-safe in-memory map that keeps itself consistent across frontend replicas via NATS, with an optional pluggable durable Store and hydrate-from-source convergence. Several features keep process-local state surfaced to the API (finetune/quant jobs, agent tasks, model configs) and each hand-wired the same in-memory + NATS broadcast + read-through-store legs - or forgot to, reintroducing cross-replica staleness. SyncedMap makes that consistency a configuration choice: - local writes mutate the map, write through the Store, then broadcast a delta; - the apply path is memory-only and never re-publishes or re-writes the Store (structural echo-loop guard, mirroring galleryop.mergeStatus); - on Start and on NATS reconnect the map re-hydrates from the source (Store, else Loader); an optional periodic Reconcile repairs silent drift; - standalone mode (nil NATS client) is a strict in-memory no-op. Reconnect re-hydrate is wired via a new *messaging.Client.OnReconnect callback, consumed through an optional type-assertion so MessagingClient stays minimal. Adds messaging.SubjectSyncStateDelta and a reusable testutil.FakeBus (synchronous in-process MessagingClient with wildcard matching) for adopter tests. Component only; service migrations follow in subsequent commits. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * refactor(finetune): back jobs with SyncedMap for cross-replica consistency FineTuneService kept jobs in a process-local map and, although it wrote them to Postgres, ListJobs/GetJob never read the store back and the wired natsClient was never used - so in distributed mode a job created on one replica was invisible to the others. Replace the map and the dead client with a syncstate.SyncedMap keyed by job ID, value *schema.FineTuneJob (the exact REST shape, so responses are unchanged). - Add a Store adapter (core/services/finetune/syncstore.go) over FineTuneStore, plus FineTuneStore.ListAll (global hydrate; per-user List kept) and an idempotent Upsert (create-or-update; Create alone fails on dup key). - Writes go through SyncedMap.Set/Delete (write-through + broadcast); reads use List/Get. The on-disk state.json path becomes the standalone Loader, keeping single-node restart recovery (stale->stopped / exporting->failed fixups). - Fold SetNATSClient/SetFineTuneStore into NewFineTuneService; app.go passes the distributed NATS client + store when distributed, nil otherwise. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * refactor(agentpool): back agent tasks with SyncedMap for cross-replica consistency AgentJobService.ListTasks read the process-local tasks map only, while ListJobs already read through the DB persister + dispatcher NATS - so in distributed mode a task created on one replica was invisible to the others. Back tasks with a syncstate.SyncedMap keyed by task ID (value schema.Task, the exact REST shape); jobs are left untouched. - Store adapter (task_syncstore.go) over the existing JobPersister (LoadTasks/SaveTask/DeleteTask); reads svc.persister/userID live so a persister swap needs no rebuild. No new persister methods required. - Task reads -> SyncedMap.List/Get; create/update -> Set (write-through + broadcast); delete -> Delete. The file persister now owns its own task set so the write-through path does not re-enter the SyncedMap lock (deadlock guard). - The distributed NATS client is not available at construction (start() precedes initDistributed), so it is injected via SetTaskSyncNATS, which rebuilds the still-empty map before Start/hydrate. Wired at the main, restart, and per-user (UserServicesManager) distributed sites. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * refactor(quantization): back jobs with SyncedMap + durable QuantStore QuantizationService kept jobs in a process-local map persisted only to a local state.json, so in distributed mode jobs were neither visible across replicas nor durable cluster-wide. Back jobs with a syncstate.SyncedMap keyed by job ID (value *schema.QuantizationJob, the exact REST shape). - New distributed.QuantStore (GORM, table quantization_jobs) mirroring FineTuneStore: Create/Get/ListAll/Upsert(idempotent)/Delete, registered for AutoMigrate via distributed.InitStores (Stores.Quant). - New adapter (quantization/syncstore.go) over QuantStore implementing syncstate.Store, with record<->schema conversion. - Reads go through List/Get, writes through Set/Delete (write-through + broadcast); state.json is kept as the standalone Loader for single-node restart recovery (stale-job fixups preserved). - app.go passes the distributed NATS client + QuantStore when distributed, nil otherwise; Start/Close lifecycle mirrors finetune. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(syncstate): annotate gosec G118 false positive on lifeCtx gosec flagged the WithCancel in Start as "cancellation function not called" because the returned cancel is stored on the struct rather than called/deferred in scope. It is invoked in Close (covered by tests), and lifeCtx must outlive Start to drive the reconnect/reconcile goroutines. Suppress the verified false positive with a justified #nosec G118. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * test(distributed): e2e two-replica SyncedMap sync over real NATS + Postgres Adds the real-infrastructure counterpart to the fake-bus unit tests, in the existing distributed e2e suite (testcontainers NATS + PostgreSQL). Two SyncedMap instances stand in for two frontend replicas - each with its OWN NATS connection to a shared server and a SHARED Postgres store (the distributed-mode invariant) - and assert, over the wire: - a create on replica A is observed by replica B; - an update and a delete propagate A -> B (delete prunes, which a reload cannot); - a late-joining replica recovers a job it never received a delta for, via store hydrate on Start (the at-most-once gap a fake bus cannot exercise); - a local Set is written through to the shared Postgres store. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
135 lines
5.0 KiB
Go
135 lines
5.0 KiB
Go
package distributed
|
|
|
|
import (
|
|
"context"
|
|
"fmt"
|
|
"time"
|
|
|
|
"github.com/google/uuid"
|
|
"github.com/mudler/LocalAI/core/services/advisorylock"
|
|
"gorm.io/gorm"
|
|
"gorm.io/gorm/clause"
|
|
)
|
|
|
|
// FineTuneJobRecord tracks fine-tune jobs in PostgreSQL.
|
|
type FineTuneJobRecord struct {
|
|
ID string `gorm:"primaryKey;size:36" json:"id"`
|
|
UserID string `gorm:"index;size:36" json:"user_id,omitempty"`
|
|
Model string `gorm:"size:255" json:"model"`
|
|
Backend string `gorm:"size:64" json:"backend"`
|
|
ModelID string `gorm:"size:255" json:"model_id,omitempty"`
|
|
TrainingType string `gorm:"size:32" json:"training_type"` // lora, loha, lokr, full
|
|
TrainingMethod string `gorm:"size:32" json:"training_method"` // sft, dpo, grpo, etc.
|
|
Status string `gorm:"index;size:32;default:queued" json:"status"` // queued, loading_model, training, saving, completed, failed, stopped
|
|
Message string `gorm:"type:text" json:"message,omitempty"`
|
|
OutputDir string `gorm:"size:512" json:"output_dir,omitempty"`
|
|
ConfigJSON string `gorm:"column:config;type:text" json:"-"`
|
|
ExtraOptsJSON string `gorm:"column:extra_options;type:text" json:"-"`
|
|
BackendNodeID string `gorm:"size:36" json:"backend_node_id,omitempty"` // which backend node runs it
|
|
ExportStatus string `gorm:"size:32" json:"export_status,omitempty"`
|
|
ExportMessage string `gorm:"type:text" json:"export_message,omitempty"`
|
|
ExportModelName string `gorm:"size:255" json:"export_model_name,omitempty"`
|
|
CreatedAt time.Time `json:"created_at"`
|
|
UpdatedAt time.Time `json:"updated_at"`
|
|
}
|
|
|
|
func (FineTuneJobRecord) TableName() string { return "finetune_jobs" }
|
|
|
|
// FineTuneStore manages fine-tune job state in PostgreSQL.
|
|
type FineTuneStore struct {
|
|
db *gorm.DB
|
|
}
|
|
|
|
// NewFineTuneStore creates a new FineTuneStore and auto-migrates.
|
|
// Uses a PostgreSQL advisory lock to prevent concurrent migration races
|
|
// when multiple instances (frontend + workers) start at the same time.
|
|
func NewFineTuneStore(db *gorm.DB) (*FineTuneStore, error) {
|
|
if err := advisorylock.WithLockCtx(context.Background(), db, advisorylock.KeySchemaMigrate, func() error {
|
|
return db.AutoMigrate(&FineTuneJobRecord{})
|
|
}); err != nil {
|
|
return nil, fmt.Errorf("migrating finetune_jobs: %w", err)
|
|
}
|
|
return &FineTuneStore{db: db}, nil
|
|
}
|
|
|
|
// Create stores a new fine-tune job.
|
|
func (s *FineTuneStore) Create(job *FineTuneJobRecord) error {
|
|
if job.ID == "" {
|
|
job.ID = uuid.New().String()
|
|
}
|
|
job.CreatedAt = time.Now()
|
|
job.UpdatedAt = job.CreatedAt
|
|
return s.db.Create(job).Error
|
|
}
|
|
|
|
// Get retrieves a fine-tune job by ID.
|
|
func (s *FineTuneStore) Get(id string) (*FineTuneJobRecord, error) {
|
|
var job FineTuneJobRecord
|
|
if err := s.db.First(&job, "id = ?", id).Error; err != nil {
|
|
return nil, err
|
|
}
|
|
return &job, nil
|
|
}
|
|
|
|
// List returns fine-tune jobs, optionally filtered by user.
|
|
func (s *FineTuneStore) List(userID string) ([]FineTuneJobRecord, error) {
|
|
var jobs []FineTuneJobRecord
|
|
q := s.db.Order("created_at DESC")
|
|
if userID != "" {
|
|
q = q.Where("user_id = ?", userID)
|
|
}
|
|
return jobs, q.Find(&jobs).Error
|
|
}
|
|
|
|
// ListAll returns every fine-tune job across all users. The SyncedMap that backs
|
|
// FineTuneService is a single global map (the REST API filters by user at read
|
|
// time), so hydrate needs the full set rather than the per-user List above.
|
|
func (s *FineTuneStore) ListAll() ([]FineTuneJobRecord, error) {
|
|
var jobs []FineTuneJobRecord
|
|
return jobs, s.db.Order("created_at DESC").Find(&jobs).Error
|
|
}
|
|
|
|
// Upsert idempotently inserts or fully replaces a job row by primary key. The
|
|
// SyncedMap write-through path issues a single Set per mutation regardless of
|
|
// whether the job already exists, so it needs one create-or-update primitive
|
|
// (Create alone fails on a duplicate key, UpdateStatus alone misses new rows and
|
|
// only touches a few columns).
|
|
func (s *FineTuneStore) Upsert(job *FineTuneJobRecord) error {
|
|
if job.ID == "" {
|
|
job.ID = uuid.New().String()
|
|
}
|
|
now := time.Now()
|
|
if job.CreatedAt.IsZero() {
|
|
job.CreatedAt = now
|
|
}
|
|
job.UpdatedAt = now
|
|
return s.db.Clauses(clause.OnConflict{
|
|
Columns: []clause.Column{{Name: "id"}},
|
|
UpdateAll: true,
|
|
}).Create(job).Error
|
|
}
|
|
|
|
// UpdateStatus updates the status and message of a fine-tune job.
|
|
func (s *FineTuneStore) UpdateStatus(id, status, message string) error {
|
|
return s.db.Model(&FineTuneJobRecord{}).Where("id = ?", id).Updates(map[string]any{
|
|
"status": status,
|
|
"message": message,
|
|
"updated_at": time.Now(),
|
|
}).Error
|
|
}
|
|
|
|
// UpdateExportStatus updates the export state of a fine-tune job.
|
|
func (s *FineTuneStore) UpdateExportStatus(id, status, message, modelName string) error {
|
|
return s.db.Model(&FineTuneJobRecord{}).Where("id = ?", id).Updates(map[string]any{
|
|
"export_status": status,
|
|
"export_message": message,
|
|
"export_model_name": modelName,
|
|
"updated_at": time.Now(),
|
|
}).Error
|
|
}
|
|
|
|
// Delete removes a fine-tune job.
|
|
func (s *FineTuneStore) Delete(id string) error {
|
|
return s.db.Where("id = ?", id).Delete(&FineTuneJobRecord{}).Error
|
|
}
|