mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 18:06:58 -04:00
* feat(distributed): add SyncedMap cross-replica in-memory state component Introduce core/services/syncstate.SyncedMap[K,V]: a thread-safe in-memory map that keeps itself consistent across frontend replicas via NATS, with an optional pluggable durable Store and hydrate-from-source convergence. Several features keep process-local state surfaced to the API (finetune/quant jobs, agent tasks, model configs) and each hand-wired the same in-memory + NATS broadcast + read-through-store legs - or forgot to, reintroducing cross-replica staleness. SyncedMap makes that consistency a configuration choice: - local writes mutate the map, write through the Store, then broadcast a delta; - the apply path is memory-only and never re-publishes or re-writes the Store (structural echo-loop guard, mirroring galleryop.mergeStatus); - on Start and on NATS reconnect the map re-hydrates from the source (Store, else Loader); an optional periodic Reconcile repairs silent drift; - standalone mode (nil NATS client) is a strict in-memory no-op. Reconnect re-hydrate is wired via a new *messaging.Client.OnReconnect callback, consumed through an optional type-assertion so MessagingClient stays minimal. Adds messaging.SubjectSyncStateDelta and a reusable testutil.FakeBus (synchronous in-process MessagingClient with wildcard matching) for adopter tests. Component only; service migrations follow in subsequent commits. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * refactor(finetune): back jobs with SyncedMap for cross-replica consistency FineTuneService kept jobs in a process-local map and, although it wrote them to Postgres, ListJobs/GetJob never read the store back and the wired natsClient was never used - so in distributed mode a job created on one replica was invisible to the others. Replace the map and the dead client with a syncstate.SyncedMap keyed by job ID, value *schema.FineTuneJob (the exact REST shape, so responses are unchanged). - Add a Store adapter (core/services/finetune/syncstore.go) over FineTuneStore, plus FineTuneStore.ListAll (global hydrate; per-user List kept) and an idempotent Upsert (create-or-update; Create alone fails on dup key). - Writes go through SyncedMap.Set/Delete (write-through + broadcast); reads use List/Get. The on-disk state.json path becomes the standalone Loader, keeping single-node restart recovery (stale->stopped / exporting->failed fixups). - Fold SetNATSClient/SetFineTuneStore into NewFineTuneService; app.go passes the distributed NATS client + store when distributed, nil otherwise. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * refactor(agentpool): back agent tasks with SyncedMap for cross-replica consistency AgentJobService.ListTasks read the process-local tasks map only, while ListJobs already read through the DB persister + dispatcher NATS - so in distributed mode a task created on one replica was invisible to the others. Back tasks with a syncstate.SyncedMap keyed by task ID (value schema.Task, the exact REST shape); jobs are left untouched. - Store adapter (task_syncstore.go) over the existing JobPersister (LoadTasks/SaveTask/DeleteTask); reads svc.persister/userID live so a persister swap needs no rebuild. No new persister methods required. - Task reads -> SyncedMap.List/Get; create/update -> Set (write-through + broadcast); delete -> Delete. The file persister now owns its own task set so the write-through path does not re-enter the SyncedMap lock (deadlock guard). - The distributed NATS client is not available at construction (start() precedes initDistributed), so it is injected via SetTaskSyncNATS, which rebuilds the still-empty map before Start/hydrate. Wired at the main, restart, and per-user (UserServicesManager) distributed sites. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * refactor(quantization): back jobs with SyncedMap + durable QuantStore QuantizationService kept jobs in a process-local map persisted only to a local state.json, so in distributed mode jobs were neither visible across replicas nor durable cluster-wide. Back jobs with a syncstate.SyncedMap keyed by job ID (value *schema.QuantizationJob, the exact REST shape). - New distributed.QuantStore (GORM, table quantization_jobs) mirroring FineTuneStore: Create/Get/ListAll/Upsert(idempotent)/Delete, registered for AutoMigrate via distributed.InitStores (Stores.Quant). - New adapter (quantization/syncstore.go) over QuantStore implementing syncstate.Store, with record<->schema conversion. - Reads go through List/Get, writes through Set/Delete (write-through + broadcast); state.json is kept as the standalone Loader for single-node restart recovery (stale-job fixups preserved). - app.go passes the distributed NATS client + QuantStore when distributed, nil otherwise; Start/Close lifecycle mirrors finetune. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(syncstate): annotate gosec G118 false positive on lifeCtx gosec flagged the WithCancel in Start as "cancellation function not called" because the returned cancel is stored on the struct rather than called/deferred in scope. It is invoked in Close (covered by tests), and lifeCtx must outlive Start to drive the reconnect/reconcile goroutines. Suppress the verified false positive with a justified #nosec G118. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * test(distributed): e2e two-replica SyncedMap sync over real NATS + Postgres Adds the real-infrastructure counterpart to the fake-bus unit tests, in the existing distributed e2e suite (testcontainers NATS + PostgreSQL). Two SyncedMap instances stand in for two frontend replicas - each with its OWN NATS connection to a shared server and a SHARED Postgres store (the distributed-mode invariant) - and assert, over the wire: - a create on replica A is observed by replica B; - an update and a delete propagate A -> B (delete prunes, which a reload cannot); - a late-joining replica recovers a job it never received a delta for, via store hydrate on Start (the at-most-once gap a fake bus cannot exercise); - a local Set is written through to the shared Postgres store. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
162 lines
5.5 KiB
Go
162 lines
5.5 KiB
Go
package distributed_test
|
|
|
|
import (
|
|
"context"
|
|
|
|
"github.com/mudler/LocalAI/core/services/distributed"
|
|
"github.com/mudler/LocalAI/core/services/messaging"
|
|
"github.com/mudler/LocalAI/core/services/syncstate"
|
|
|
|
. "github.com/onsi/ginkgo/v2"
|
|
. "github.com/onsi/gomega"
|
|
|
|
pgdriver "gorm.io/driver/postgres"
|
|
"gorm.io/gorm"
|
|
"gorm.io/gorm/logger"
|
|
)
|
|
|
|
// ftSyncStore adapts the real FineTuneStore to syncstate.Store, exactly as the
|
|
// finetune service does in production. Defined here (rather than reusing the
|
|
// service's unexported adapter) so the e2e exercises the store + component over
|
|
// real infrastructure without pulling in backend execution.
|
|
type ftSyncStore struct{ s *distributed.FineTuneStore }
|
|
|
|
func (a ftSyncStore) List(_ context.Context) ([]*distributed.FineTuneJobRecord, error) {
|
|
recs, err := a.s.ListAll()
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
out := make([]*distributed.FineTuneJobRecord, len(recs))
|
|
for i := range recs {
|
|
r := recs[i]
|
|
out[i] = &r
|
|
}
|
|
return out, nil
|
|
}
|
|
|
|
func (a ftSyncStore) Upsert(_ context.Context, r *distributed.FineTuneJobRecord) error {
|
|
return a.s.Upsert(r)
|
|
}
|
|
|
|
func (a ftSyncStore) Delete(_ context.Context, k string) error { return a.s.Delete(k) }
|
|
|
|
// This suite is the real-infrastructure counterpart to the fake-bus unit tests:
|
|
// two SyncedMap instances stand in for two LocalAI frontend replicas, each with
|
|
// its OWN NATS connection to a shared NATS server and a SHARED PostgreSQL store -
|
|
// the exact distributed-mode invariant (single shared DB, per-replica process
|
|
// state). It proves the delta path works over the wire and that a late-joining
|
|
// replica recovers via store hydrate (the at-most-once gap a fake bus cannot
|
|
// exercise).
|
|
var _ = Describe("SyncedMap two-replica sync over real NATS", Label("Distributed"), func() {
|
|
var (
|
|
infra *TestInfra
|
|
ftStore *distributed.FineTuneStore
|
|
)
|
|
|
|
BeforeEach(func() {
|
|
infra = SetupInfra("localai_syncstate_dist_test")
|
|
|
|
db, err := gorm.Open(pgdriver.Open(infra.PGURL), &gorm.Config{
|
|
Logger: logger.Default.LogMode(logger.Silent),
|
|
})
|
|
Expect(err).ToNot(HaveOccurred())
|
|
|
|
ftStore, err = distributed.NewFineTuneStore(db)
|
|
Expect(err).ToNot(HaveOccurred())
|
|
})
|
|
|
|
// newReplica builds an independent "replica": its own NATS client to the
|
|
// shared server plus a SyncedMap over the shared store, started (hydrate +
|
|
// subscribe) and cleaned up automatically.
|
|
newReplica := func() *syncstate.SyncedMap[string, *distributed.FineTuneJobRecord] {
|
|
GinkgoHelper()
|
|
nc, err := messaging.New(infra.NatsURL)
|
|
Expect(err).ToNot(HaveOccurred())
|
|
|
|
sm := syncstate.New(syncstate.Config[string, *distributed.FineTuneJobRecord]{
|
|
Name: "finetune.jobs",
|
|
Key: func(r *distributed.FineTuneJobRecord) string { return r.ID },
|
|
Nats: nc,
|
|
Store: ftSyncStore{s: ftStore},
|
|
})
|
|
Expect(sm.Start(infra.Ctx)).To(Succeed())
|
|
FlushNATS(nc) // ensure the subscription is registered server-side before any publish
|
|
DeferCleanup(func() {
|
|
_ = sm.Close()
|
|
nc.Close()
|
|
})
|
|
return sm
|
|
}
|
|
|
|
rec := func(id, status string) *distributed.FineTuneJobRecord {
|
|
return &distributed.FineTuneJobRecord{
|
|
ID: id, UserID: "u1", Model: "m", Backend: "b",
|
|
TrainingType: "lora", TrainingMethod: "sft", Status: status,
|
|
}
|
|
}
|
|
|
|
It("propagates a create from replica A to replica B over the wire", func() {
|
|
a := newReplica()
|
|
b := newReplica()
|
|
|
|
Expect(a.Set(infra.Ctx, rec("job-1", "queued"))).To(Succeed())
|
|
|
|
Eventually(func() bool { _, ok := b.Get("job-1"); return ok }, "10s", "50ms").
|
|
Should(BeTrue(), "replica B must observe the job created on A via NATS")
|
|
|
|
got, ok := b.Get("job-1")
|
|
Expect(ok).To(BeTrue())
|
|
Expect(got.Status).To(Equal("queued"))
|
|
})
|
|
|
|
It("propagates an update and a delete across replicas", func() {
|
|
a := newReplica()
|
|
b := newReplica()
|
|
|
|
Expect(a.Set(infra.Ctx, rec("job-2", "queued"))).To(Succeed())
|
|
Eventually(func() bool { _, ok := b.Get("job-2"); return ok }, "10s", "50ms").Should(BeTrue())
|
|
|
|
// Update on A -> B reflects the new status.
|
|
Expect(a.Set(infra.Ctx, rec("job-2", "training"))).To(Succeed())
|
|
Eventually(func() string {
|
|
if r, ok := b.Get("job-2"); ok {
|
|
return r.Status
|
|
}
|
|
return ""
|
|
}, "10s", "50ms").Should(Equal("training"))
|
|
|
|
// Delete on A -> B prunes (a reload-from-path could not do this).
|
|
Expect(a.Delete(infra.Ctx, "job-2")).To(Succeed())
|
|
Eventually(func() bool { _, ok := b.Get("job-2"); return ok }, "10s", "50ms").
|
|
Should(BeFalse(), "replica B must drop the job deleted on A")
|
|
})
|
|
|
|
It("hydrates a late-joining replica from the shared store (missed-delta recovery)", func() {
|
|
a := newReplica()
|
|
|
|
// Written (and broadcast) BEFORE replica C exists, so C can never receive
|
|
// the delta - it can only learn the job by hydrating from shared Postgres
|
|
// on Start. This is the at-most-once gap a fake bus cannot exercise.
|
|
Expect(a.Set(infra.Ctx, rec("job-3", "completed"))).To(Succeed())
|
|
Eventually(func() (*distributed.FineTuneJobRecord, error) { return ftStore.Get("job-3") }, "10s", "50ms").
|
|
ShouldNot(BeNil(), "write-through must reach the shared store first")
|
|
|
|
c := newReplica() // joins late; Start() hydrates from the store synchronously
|
|
|
|
got, ok := c.Get("job-3")
|
|
Expect(ok).To(BeTrue(), "late replica must recover the job via store hydrate, not a delta")
|
|
Expect(got.Status).To(Equal("completed"))
|
|
})
|
|
|
|
It("write-through persists a local Set to the shared PostgreSQL store", func() {
|
|
a := newReplica()
|
|
|
|
Expect(a.Set(infra.Ctx, rec("job-4", "queued"))).To(Succeed())
|
|
|
|
persisted, err := ftStore.Get("job-4")
|
|
Expect(err).ToNot(HaveOccurred())
|
|
Expect(persisted.ID).To(Equal("job-4"))
|
|
Expect(persisted.Status).To(Equal("queued"))
|
|
})
|
|
})
|