fix(distributed): don't let a dead worker pin the model-load advisory lock (#10600 )

* fix(distributed): don't let a dead worker pin the model-load advisory lock In distributed mode a chat request could fail with: failed to route model with internal loader: routing model ...: loading model ...: advisorylock: acquiring lock <id>: ERROR: canceling statement due to lock timeout (SQLSTATE 55P03) Root cause is two independent defects in the cross-replica model-load path: 1. SmartRouter.Route holds a per-model PostgreSQL advisory lock for the whole cold-load sequence, which includes installBackendOnNode -> InstallBackend, a NATS request-reply with a 15m deadline (DefaultBackendInstallTimeout) that ignored ctx. When the chosen worker died mid-install, the holder sat on the lock for up to 15m. The detached loadCtx (WithoutCancel) had no deadline, so nothing capped the hold. 2. The acquiring statement, pg_advisory_lock(), is subject to any deployment global lock_timeout. A common operator setting (e.g. 10s) aborts the wait with SQLSTATE 55P03, so every other replica's request for that model hard -errored instead of waiting for the in-progress load and reusing it. For the ~15m window the model was effectively unroutable. Fixes: - advisorylock.WithLockCtx (postgres): SET lock_timeout = 0 on its dedicated connection (RESET before it returns to the pool) so the Go context, not a deployment-wide GUC, governs how long we wait. Waiters now block and then re-check, reusing the model another replica just loaded. - SmartRouter: bound the detached loadCtx with a single ModelLoadCeiling so the lock is always released in bounded time even if a sub-step wedges. Default is the configured backend.install deadline + 10m (staging + LoadModel margin), so a legitimately slow load is never cut. - installBackendOnNode: use singleflight.DoChan + select on ctx.Done() so the install wait honors cancellation; the ceiling can then actually free a caller pinned behind a dead worker. The shared install still coalesces via singleflight. Reproduced both defects as failing tests first (a real 55P03 against a testcontainer with a short lock_timeout; a wedged install that blocks Route) and confirmed green. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(distributed): bound advisory-lock wait instead of disabling lock_timeout Setting lock_timeout = 0 to override a deployment's short global lock_timeout meant "wait forever" server-side. Safe for SmartRouter.Route (its loadCtx now carries the model-load ceiling) but unsafe for the schema-migration callers that pass context.Background(): a holder whose session never releases would hang them indefinitely. Derive the server-side lock_timeout from the caller's context instead: its remaining budget plus a margin (so the Go context's cancellation still wins with a clean error and the server bound is only a backstop), or a finite 30m backstop when the context has no deadline. Never zero - "wait forever" is no longer possible, while a deployment's hostile short lock_timeout is still overridden so legitimate cross-replica waits don't fail with 55P03. Added a spec proving a deadline-less waiter gives up at the (shrunk) backstop rather than hanging. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
chore: ⬆️ Update CrispStrobe/CrispASR to fcbc8718e654995e3bd2d0c98bcb8e55e297d23c (#10634 )
2026-07-02 04:16:56 -04:00 · 2026-07-02 09:52:51 +02:00 · 2026-07-02 09:48:20 +02:00 · 2026-07-02 09:48:08 +02:00 · 2026-07-02 09:47:54 +02:00 · 2026-07-02 09:47:41 +02:00
23 changed files with 327 additions and 30 deletions
--- a/11
+++ b/11
@@ -171,6 +171,17 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ]; then \
    ln -s /opt/rocm-**/lib/llvm/lib/libomp.so /usr/lib/libomp.so \
    ; fi

+# ROCm's bundled libdrm_amdgpu is built with a hardcoded fallback lookup path
+# for the ASIC ID table (/opt/amdgpu/share/libdrm/amdgpu.ids), which only exists
+# if AMD's full amdgpu graphics/DKMS stack is installed. This compute-only image
+# doesn't have it, so hipblas/rocBLAS log "No such file or directory" on every
+# model load and can fail to identify the GPU. Point it at the equivalent file
+# Ubuntu's libdrm-common package already ships.
+RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ -f /usr/share/libdrm/amdgpu.ids ] && [ ! -e /opt/amdgpu/share/libdrm/amdgpu.ids ]; then \
+    mkdir -p /opt/amdgpu/share/libdrm && \
+    ln -s /usr/share/libdrm/amdgpu.ids /opt/amdgpu/share/libdrm/amdgpu.ids \
+    ; fi
+
 RUN expr "${BUILD_TYPE}" = intel && echo "intel" > /run/localai/capability || echo "not intel"

 # Cuda
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=f74a6fb87b315b2c3154166e075360e15021a61d
+IK_LLAMA_VERSION?=068b173649f2fd8dc96b35ada5a0b76d8985105d
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@

-LLAMA_VERSION?=6f4f53f2b7da54fcdbbecaaa734337c337ad6176
+LLAMA_VERSION?=4fc4ec5541b243957ae5099edb67372f8f3b550e
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp

 CMAKE_ARGS?=
--- a/backend/go/crispasr/Makefile
+++ b/backend/go/crispasr/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # CrispASR version (release tag)
 CRISPASR_REPO?=https://github.com/CrispStrobe/CrispASR
-CRISPASR_VERSION?=3b93758f9725d400eca82976f895e4cec3f31260
+CRISPASR_VERSION?=fcbc8718e654995e3bd2d0c98bcb8e55e297d23c
 SO_TARGET?=libgocrispasr.so

 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
--- a/backend/go/parakeet-cpp/Makefile
+++ b/backend/go/parakeet-cpp/Makefile
@@ -1,6 +1,6 @@
 # parakeet-cpp backend Makefile.
 #
-# Upstream pin lives below as PARAKEET_VERSION?=f469a57270a1cc4554acb15febf60e56619673b9
+# Upstream pin lives below as PARAKEET_VERSION?=e8acc6172a94e20a952cf1843decace5d771a94b
 # (.github/bump_deps.sh) can find and update it - matches the
 # whisper.cpp / ds4 / vibevoice-cpp convention.
 #
@@ -15,7 +15,7 @@
 # That's what the L0 smoke test uses. The default target below does the
 # proper clone-at-pin + cmake build so CI doesn't need a side-checkout.

-PARAKEET_VERSION?=f469a57270a1cc4554acb15febf60e56619673b9
+PARAKEET_VERSION?=e8acc6172a94e20a952cf1843decace5d771a94b
 PARAKEET_REPO?=https://github.com/mudler/parakeet.cpp

 GOCMD?=go
--- a/backend/go/stablediffusion-ggml/Makefile
+++ b/backend/go/stablediffusion-ggml/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # stablediffusion.cpp (ggml)
 STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
-STABLEDIFFUSION_GGML_VERSION?=3b6c9ca97cfcda8e68e719e6670d06379fcbe943
+STABLEDIFFUSION_GGML_VERSION?=3590aa8d626e671a1b1dc84506ea2932a243a480

 CMAKE_ARGS+=-DGGML_MAX_NAME=128

--- a/backend/go/stablediffusion-ggml/cpp/gosd.cpp
+++ b/backend/go/stablediffusion-ggml/cpp/gosd.cpp
@@ -798,6 +798,7 @@ void sd_img_gen_params_set_seed(sd_img_gen_params_t *params, int64_t seed) {
 int gen_image(sd_img_gen_params_t *p, int steps, char *dst, float cfg_scale, char *src_image, float strength, char *mask_image, char* ref_images[], int ref_images_count) {

    sd_image_t* results;
+    int num_results_out = 0;

    std::vector<int> skip_layers = {7, 8, 9};

@@ -994,10 +995,14 @@ int gen_image(sd_img_gen_params_t *p, int steps, char *dst, float cfg_scale, cha
            sd_ctx_params_to_str(&ctx_params),
            sd_img_gen_params_to_str(p));

-    results = generate_image(sd_c, p);
+    bool gen_ok = generate_image(sd_c, p, &results, &num_results_out);

    std::free(p);

+    if (!gen_ok || num_results_out == 0) {
+        results = NULL;
+    }
+
    if (results == NULL) {
        fprintf (stderr, "NO results\n");
        if (input_image_buffer) free(input_image_buffer);
--- a/backend/go/whisper/Makefile
+++ b/backend/go/whisper/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # whisper.cpp version
 WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
-WHISPER_CPP_VERSION?=0ae02cdb2c7317b50991367c165736ce42ed96ac
+WHISPER_CPP_VERSION?=6fc7c33b4c3a2cec83e4b65abd5e96a890480375
 SO_TARGET?=libgowhisper.so

 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
--- a/backend/python/vllm/backend.py
+++ b/backend/python/vllm/backend.py
@@ -748,7 +748,12 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
        # When (A) native streaming ran cleanly, per-delta yields above already
        # delivered everything — do NOT extract again on the full text or we'd
        # duplicate content/tool_calls into the final chunk.
-        if has_tool_parser and not (native_streaming and not native_streaming_error):
+        # NOTE: `native_streaming` is a capability flag ("streaming parser is
+        # available"), not a state flag ("streaming actually ran"). For
+        # non-streaming requests it is still True but the per-delta loop was
+        # never entered, so we MUST still run extract_tool_calls here. Hence
+        # the explicit `streaming and …` guard on both branches.
+        if has_tool_parser and not (streaming and native_streaming and not native_streaming_error):
            try:
                tp = tp_instance
                if tp is None:
@@ -770,7 +775,7 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
                        ))
            except Exception as e:
                print(f"Tool parser error: {e}", file=sys.stderr)
-        elif native_streaming and not native_streaming_error:
+        elif streaming and native_streaming and not native_streaming_error:
            # Per-delta path already emitted content + tool_calls; the final
            # chat_delta should carry only metadata (token counts, logprobs).
            content = ""
--- a/backend/python/vllm/install.sh
+++ b/backend/python/vllm/install.sh
@@ -104,7 +104,7 @@ if [ "$(uname -s)" = "Darwin" ]; then
    # can rewrite it. Darwin therefore follows vllm-metal and can lag the Linux
    # vllm pin (requirements-cublas13-after.txt, bumped independently against
    # vllm/vllm) until vllm-metal supports a newer vLLM.
-    VLLM_METAL_VERSION="v0.3.0.dev20260628073537"
+    VLLM_METAL_VERSION="v0.3.0.dev20260701132215"

    # The coupled vLLM source version is whatever this vllm-metal release builds
    # against -- it declares it in its own installer as `vllm_v=`. Derive it from
--- a/backend/python/vllm/requirements-cublas13-after.txt
+++ b/backend/python/vllm/requirements-cublas13-after.txt
@@ -3,8 +3,8 @@
 # on a cu130 host. Pull the cu130-flavoured wheel from vLLM's per-tag index
 # instead — the cublas13 case in install.sh adds --index-strategy=unsafe-best-match
 # so uv consults this index alongside PyPI.
--extra-index-url https://wheels.vllm.ai/0.23.0/cu130
+--extra-index-url https://wheels.vllm.ai/0.24.0/cu130
 # VERSION COUPLING: darwin/Apple-Silicon builds use vllm-metal (see install.sh),
 # which pins this exact vLLM version. Bumping vllm here means coordinating with a
 # vllm-metal release that supports the new version, or macOS/Metal builds break.
-vllm==0.23.0
+vllm==0.24.0
--- a/backend/rust/kokoros/src/service.rs
+++ b/backend/rust/kokoros/src/service.rs
@@ -351,6 +351,16 @@ impl Backend for KokorosService {
        Err(Status::unimplemented("Not supported"))
    }

+    type AudioTranscriptionLiveStream =
+        ReceiverStream<Result<backend::TranscriptLiveResponse, Status>>;
+
+    async fn audio_transcription_live(
+        &self,
+        _: Request<tonic::Streaming<backend::TranscriptLiveRequest>>,
+    ) -> Result<Response<Self::AudioTranscriptionLiveStream>, Status> {
+        Err(Status::unimplemented("Not supported"))
+    }
+
    async fn diarize(
        &self,
        _: Request<backend::DiarizeRequest>,
--- a/cmd/launcher/internal/launcher.go
+++ b/cmd/launcher/internal/launcher.go
@@ -207,12 +207,20 @@ func (l *Launcher) StartLocalAI() error {
 	}

 	// Build command arguments
+	dataPath := l.GetDataPath()
 	args := []string{
 		"run",
 		"--models-path", l.config.ModelsPath,
 		"--backends-path", l.config.BackendsPath,
 		"--address", l.config.Address,
 		"--log-level", l.config.LogLevel,
+		// Keep persistent data and dynamic config under the launcher's data
+		// directory (~/.localai) rather than letting the server resolve them
+		// to ${basepath}/{data,configuration}. ${basepath} expands to the
+		// launcher process's CWD (often the user's home root), which puts
+		// ~/data and ~/configuration outside ~/.localai. See #10610.
+		"--data-path", filepath.Join(dataPath, "data"),
+		"--localai-config-dir", filepath.Join(dataPath, "configuration"),
 	}

 	l.localaiCmd = exec.CommandContext(l.ctx, binaryPath, args...)
--- a/core/application/distributed.go
+++ b/core/application/distributed.go
@@ -356,6 +356,12 @@ func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB, configLoade
 		PrefixConfig:     prefixCfg,
 		Pressure:         pressure,
 		SharedModels:     cfg.Distributed.SharedModels,
+		// Cap how long a cold load may hold the per-model advisory lock: the
+		// configured backend.install deadline plus a margin for file staging and
+		// the remote LoadModel. Derived from the install timeout so raising it
+		// (for slow links pulling multi-GB images) widens the ceiling too,
+		// instead of letting the static default cut a legitimately slow load.
+		ModelLoadCeiling: cfg.Distributed.BackendInstallTimeoutOrDefault() + 10*time.Minute,
 	})

 	// Wire staging-progress broadcasting so file-staging shows up on every
--- a/core/services/advisorylock/advisorylock.go
+++ b/core/services/advisorylock/advisorylock.go
@@ -6,10 +6,39 @@ import (
 	"hash/fnv"
 	"strings"
 	"sync"
+	"time"

 	"gorm.io/gorm"
 )

+// advisoryLockWaitBackstop bounds, server-side, how long we will wait to
+// acquire a blocking advisory lock when the caller's context carries no
+// deadline (e.g. a startup schema migration using context.Background()). It
+// only exists so such a caller cannot hang forever behind a holder whose
+// session never releases the lock; it is far longer than any legitimate
+// guarded section. A var (not const) so tests can shrink it.
+var advisoryLockWaitBackstop = 30 * time.Minute
+
+// advisoryLockTimeoutMargin is added to a context's remaining budget when
+// deriving the server-side lock_timeout, so the Go context's own (cleaner)
+// cancellation fires first and the server bound is only ever a backstop.
+const advisoryLockTimeoutMargin = 30 * time.Second
+
+// advisoryLockWaitBudget returns the server-side lock_timeout to use for a
+// blocking acquire: the caller context's remaining time plus a margin (so the
+// Go context still governs), or the backstop when the context has no deadline.
+// Never returns zero - "wait forever" must not be possible.
+func advisoryLockWaitBudget(ctx context.Context) time.Duration {
+	if dl, ok := ctx.Deadline(); ok {
+		budget := time.Until(dl) + advisoryLockTimeoutMargin
+		if budget < time.Second {
+			budget = time.Second
+		}
+		return budget
+	}
+	return advisoryLockWaitBackstop
+}
+
 // localLocks holds one buffered channel (capacity 1) per lock key, used as an
 // in-process mutex for non-PostgreSQL dialects (SQLite). A SQLite auth DB is
 // effectively single-process, so serializing guarded sections within this
@@ -130,6 +159,27 @@ func WithLockCtx(ctx context.Context, db *gorm.DB, key int64, fn func() error) e
 	}
 	defer conn.Close()

+	// Override any deployment-wide lock_timeout on this dedicated connection.
+	// Operators commonly set a short global lock_timeout (on the role or
+	// database) to bound ordinary row-lock waits. Applied to the blocking
+	// pg_advisory_lock below, it aborts the wait with SQLSTATE 55P03 and turns
+	// LocalAI's intentional cross-replica "wait your turn, then re-check"
+	// coordination into a hard error for the caller (e.g. a chat request that
+	// just wanted to reuse a model another replica is loading).
+	//
+	// We do NOT disable it outright (lock_timeout = 0 would wait forever, which
+	// is unsafe for the schema-migration callers that pass context.Background()).
+	// Instead we set a bound derived from the caller's context: its remaining
+	// budget plus a margin so the Go context's cancellation wins with a clean
+	// error, or a finite backstop when the context has no deadline.
+	waitBudget := advisoryLockWaitBudget(ctx)
+	if _, err := conn.ExecContext(ctx,
+		fmt.Sprintf("SET lock_timeout = %d", waitBudget.Milliseconds())); err != nil {
+		return fmt.Errorf("advisorylock: setting lock_timeout: %w", err)
+	}
+	// Restore the session default before this pooled connection is reused.
+	defer func() { _, _ = conn.ExecContext(context.Background(), "RESET lock_timeout") }()
+
 	if _, err := conn.ExecContext(ctx, "SELECT pg_advisory_lock($1)", key); err != nil {
 		return fmt.Errorf("advisorylock: acquiring lock %d: %w", key, err)
 	}
--- a/core/services/advisorylock/advisorylock_test.go
+++ b/core/services/advisorylock/advisorylock_test.go
@@ -158,6 +158,87 @@ var _ = Describe("AdvisoryLock", func() {
 			Expect(err).To(HaveOccurred())
 		})

+		It("waits out a short server-side lock_timeout instead of failing with 55P03", func() {
+			const lockKey int64 = 703
+
+			// Reproduce the production deployment that triggered this: a short
+			// global lock_timeout set on the database. Without the fix, a waiter
+			// blocked on pg_advisory_lock() is aborted by the server after this
+			// window and surfaces SQLSTATE 55P03 ("canceling statement due to
+			// lock timeout") to the caller instead of waiting for its turn.
+			Expect(db.Exec("ALTER DATABASE testdb SET lock_timeout = '300ms'").Error).ToNot(HaveOccurred())
+			sqlDB, err := db.DB()
+			Expect(err).ToNot(HaveOccurred())
+			// Drop pooled connections so subsequent ones reconnect and inherit
+			// the new database-level lock_timeout default.
+			sqlDB.SetMaxIdleConns(0)
+
+			holding := make(chan struct{})
+			released := make(chan struct{})
+			go func() {
+				defer GinkgoRecover()
+				herr := WithLockCtx(context.Background(), db, lockKey, func() error {
+					close(holding)
+					// Hold well past the 300ms server lock_timeout.
+					time.Sleep(1 * time.Second)
+					return nil
+				})
+				Expect(herr).ToNot(HaveOccurred())
+				close(released)
+			}()
+
+			<-holding // ensure the holder owns the lock before we contend
+
+			ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+			defer cancel()
+			executed := false
+			start := time.Now()
+			werr := WithLockCtx(ctx, db, lockKey, func() error {
+				executed = true
+				return nil
+			})
+			Expect(werr).ToNot(HaveOccurred(),
+				"waiter should wait out the in-progress hold, not fail with lock_timeout (55P03)")
+			Expect(executed).To(BeTrue())
+			Expect(time.Since(start)).To(BeNumerically(">=", 400*time.Millisecond),
+				"waiter should have actually waited for the holder to release")
+			<-released
+		})
+
+		It("bounds a deadline-less waiter with the backstop instead of waiting forever", func() {
+			const lockKey int64 = 704
+
+			// A caller with no context deadline (e.g. startup schema migration
+			// passing context.Background()) must not hang forever if the holder
+			// never releases. Shrink the backstop so the test is fast.
+			origBackstop := advisoryLockWaitBackstop
+			advisoryLockWaitBackstop = 500 * time.Millisecond
+			DeferCleanup(func() { advisoryLockWaitBackstop = origBackstop })
+
+			holding := make(chan struct{})
+			release := make(chan struct{})
+			go func() {
+				defer GinkgoRecover()
+				_ = WithLockCtx(context.Background(), db, lockKey, func() error {
+					close(holding)
+					<-release // hold until the test releases us
+					return nil
+				})
+			}()
+			defer close(release)
+
+			<-holding
+
+			start := time.Now()
+			err := WithLockCtx(context.Background(), db, lockKey, func() error {
+				Fail("waiter should not have acquired the still-held lock")
+				return nil
+			})
+			Expect(err).To(HaveOccurred(), "deadline-less waiter should give up at the backstop, not hang")
+			Expect(time.Since(start)).To(BeNumerically("<", 5*time.Second),
+				"backstop must cap the wait well under the test timeout")
+		})
+
 		It("serializes concurrent WithLockCtx on same key", func() {
 			const lockKey int64 = 702

--- a/core/services/nodes/router.go
+++ b/core/services/nodes/router.go
@@ -68,6 +68,13 @@ type SmartRouterOptions struct {
 	// the absolute model paths untouched so the worker loads them directly from
 	// the shared volume (#10556). See config.DistributedConfig.SharedModels.
 	SharedModels bool
+	// ModelLoadCeiling is the hard upper bound on how long a single cold-load
+	// attempt (node selection -> backend install -> file staging -> LoadModel)
+	// may run while holding the per-model advisory lock. It backstops every
+	// sub-step's own timeout so a wedged worker can never pin the lock - and
+	// every other replica's request for that model - indefinitely. Zero selects
+	// defaultModelLoadCeiling.
+	ModelLoadCeiling time.Duration
 }

 // SmartRouter routes inference requests to the best available backend node.
@@ -101,8 +108,18 @@ type SmartRouter struct {
 	// sharedModels skips file staging when all nodes mount the same models
 	// directory at the same path (see SmartRouterOptions.SharedModels).
 	sharedModels bool
+	// modelLoadCeiling bounds how long a cold load may hold the per-model
+	// advisory lock (see SmartRouterOptions.ModelLoadCeiling).
+	modelLoadCeiling time.Duration
 }

+// defaultModelLoadCeiling is the fallback hold ceiling for a cold model load.
+// It must comfortably exceed the slowest legitimate load - a multi-GB backend
+// install (DefaultBackendInstallTimeout, 15m) plus staging and the remote
+// LoadModel (5m) - so it never cuts a real load short; it only ever fires when
+// a step is genuinely wedged (e.g. a worker that died mid-install).
+const defaultModelLoadCeiling = 25 * time.Minute
+
 // probeCacheTTL is how long a successful gRPC HealthCheck on a backend is
 // trusted before the next request re-probes. Matches healthCheckTTL in
 // pkg/model/model.go so the single-process and distributed paths share a
@@ -117,6 +134,10 @@ func NewSmartRouter(registry ModelRouter, opts SmartRouterOptions) *SmartRouter
 	if factory == nil {
 		factory = &tokenClientFactory{token: opts.AuthToken}
 	}
+	ceiling := opts.ModelLoadCeiling
+	if ceiling <= 0 {
+		ceiling = defaultModelLoadCeiling
+	}
 	return &SmartRouter{
 		registry:         registry,
 		unloader:         opts.Unloader,
@@ -131,6 +152,7 @@ func NewSmartRouter(registry ModelRouter, opts SmartRouterOptions) *SmartRouter
 		prefixConfig:     opts.PrefixConfig,
 		pressure:         opts.Pressure,
 		sharedModels:     opts.SharedModels,
+		modelLoadCeiling: ceiling,
 	}
 }

@@ -383,11 +405,19 @@ func (r *SmartRouter) Route(ctx context.Context, modelID, modelName, backendType
 	// the request context. If staging were bound to it, the multi-GB upload
 	// aborts with "context canceled" mid-transfer and large models can never
 	// finish staging (the model-load outage). WithoutCancel keeps the request's
-	// values (prefix chain, etc.) but drops its cancellation/deadline. Each
-	// long step still has its own bound (the file stager's resume budget,
-	// LoadModel's 5m timeout), and the per-model advisory lock below de-dupes
-	// concurrent loaders across replicas.
-	loadCtx := context.WithoutCancel(ctx)
+	// values (prefix chain, etc.) but drops its cancellation/deadline.
+	//
+	// Detaching from the caller is necessary, but it must not be unbounded: the
+	// load runs while holding the per-model advisory lock, and a worker that
+	// dies mid-install (its backend.install never replies) would otherwise pin
+	// that lock (and every other replica's request for the same model) until
+	// the NATS install deadline alone expires. Re-impose a single hard ceiling
+	// over the whole sequence so the lock is always released in bounded time,
+	// even if a sub-step wedges. Each long step still has its own (tighter)
+	// bound; this only backstops them. The per-model advisory lock below
+	// de-dupes concurrent loaders across replicas.
+	loadCtx, cancelLoad := context.WithTimeout(context.WithoutCancel(ctx), r.modelLoadCeiling)
+	defer cancelLoad()
 	loadModel := func(ctx context.Context) (*RouteResult, error) {
 		// Re-check after acquiring lock — another request may have loaded it
 		node, nm, err := r.registry.FindAndLockNodeWithModel(ctx, trackingKey, candidateNodeIDs, pref)
@@ -916,7 +946,14 @@ func (r *SmartRouter) installBackendOnNode(ctx context.Context, node *BackendNod
 	}

 	key := fmt.Sprintf("%s|%s|%s|%d", node.ID, backendType, modelID, replicaIndex)
-	v, err, _ := r.installFlight.Do(key, func() (any, error) {
+	// DoChan rather than Do so this wait honors ctx cancellation. InstallBackend
+	// blocks for its full NATS deadline (15m by default) when a worker accepts
+	// the request but never replies (e.g. it died mid-install). Without ctx
+	// awareness the caller (holding the per-model advisory lock) would sit there
+	// the whole time; here a cancelled ctx (typically the model-load ceiling)
+	// frees the caller promptly. The shared install keeps running in the
+	// background and still coalesces other callers via singleflight.
+	resCh := r.installFlight.DoChan(key, func() (any, error) {
 		reply, err := r.unloader.InstallBackend(node.ID, backendType, modelID, r.galleriesJSON, "", "", "", replicaIndex, "", nil)
 		if err != nil {
 			return "", err
@@ -931,10 +968,15 @@ func (r *SmartRouter) installBackendOnNode(ctx context.Context, node *BackendNod
 		}
 		return addr, nil
 	})
-	if err != nil {
-		return "", err
+	select {
+	case <-ctx.Done():
+		return "", ctx.Err()
+	case res := <-resCh:
+		if res.Err != nil {
+			return "", res.Err
+		}
+		return res.Val.(string), nil
 	}
-	return v.(string), nil
 }

 func (r *SmartRouter) buildClientForAddr(node *BackendNode, addr string, parallel bool) grpc.Backend {
--- a/core/services/nodes/router_test.go
+++ b/core/services/nodes/router_test.go
@@ -493,6 +493,44 @@ var _ = Describe("SmartRouter", func() {
 				Expect(result.Node.ID).To(Equal("n3"))
 			})
 		})
+
+		Context("worker wedges mid-install (dead node holding the lock)", func() {
+			It("aborts the load at the ModelLoadCeiling instead of blocking forever", func() {
+				// Simulate the production incident: the chosen worker accepts the
+				// backend.install but never replies (it died), so InstallBackend
+				// would otherwise block for its full NATS deadline (15m by
+				// default) while pinning the per-model advisory lock. Route must
+				// give up at the ceiling so the lock is released promptly.
+				reg.findAndLockErr = errors.New("not found")
+				reg.findIdleNode = &BackendNode{ID: "n4", Name: "dead-node", Address: "10.0.0.4:50051"}
+
+				block := make(chan struct{})
+				defer close(block) // let the background install goroutine drain at test end
+				unloader.installHook = func() { <-block }
+
+				router := NewSmartRouter(reg, SmartRouterOptions{
+					Unloader:         unloader,
+					ClientFactory:    factory,
+					ModelLoadCeiling: 200 * time.Millisecond,
+				})
+
+				done := make(chan error, 1)
+				start := time.Now()
+				go func() {
+					defer GinkgoRecover()
+					_, err := router.Route(context.Background(), "wedged-model",
+						"models/wedged.gguf", "llama-cpp",
+						&pb.ModelOptions{Model: "models/wedged.gguf"}, false)
+					done <- err
+				}()
+
+				var routeErr error
+				Eventually(done, 5*time.Second).Should(Receive(&routeErr),
+					"Route must not block on a wedged install past the ceiling")
+				Expect(routeErr).To(HaveOccurred())
+				Expect(time.Since(start)).To(BeNumerically("<", 5*time.Second))
+			})
+		})
 	})

 	Describe("scheduleNewModel (mock-based, via Route)", func() {
--- a/docs/data/version.json
+++ b/docs/data/version.json
@@ -1,3 +1,3 @@
 {
-  "version": "v4.5.5"
+  "version": "v4.5.6"
 }
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -1716,7 +1716,7 @@
      - use_jinja:true
    parameters:
      min_p: 0.15
-      model: llama-cpp/models/LFM2.5-8B-A1B-GGUF/LFM2.5-8B-A1B-Q4_K_M.gguf
+      model: llama-cpp/models/LFM2.5-8B-A1B-GGUF/LFM2.5-8B-A1B-Q8_0.gguf
      repeat_penalty: 1.05
      temperature: 0.1
      top_k: 50
@@ -1724,9 +1724,9 @@
    template:
      use_tokenizer_template: true
  files:
-    - filename: llama-cpp/models/LFM2.5-8B-A1B-GGUF/LFM2.5-8B-A1B-Q4_K_M.gguf
-      uri: https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF/resolve/main/LFM2.5-8B-A1B-Q4_K_M.gguf
-      sha256: 4923ec14f06b968b74d663e5949867d2d9c3bf13a20b8be1a9f9af39989b2bb0
+    - filename: llama-cpp/models/LFM2.5-8B-A1B-GGUF/LFM2.5-8B-A1B-Q8_0.gguf
+      uri: https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF/resolve/main/LFM2.5-8B-A1B-Q8_0.gguf
+      sha256: 33ab3b8ce6a964fb8ebac89360c9b3cf72c4fa418d5e4c0a94d46883124d5c02
 - name: "qwopus3.5-9b-coder-mtp"
  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
  urls:
@@ -1758,8 +1758,8 @@
      use_tokenizer_template: true
  files:
    - filename: llama-cpp/models/Qwopus3.5-9B-Coder-MTP-GGUF/Qwopus3.5-9B-Coder-MTP-Q4_K_M.gguf
-      sha256: f6fc5d193045796d9e1870cbc40f827fe55f53f70593c3f5c1968b82b9331991
      uri: https://huggingface.co/Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF/resolve/main/Qwopus3.5-9B-Coder-MTP-Q4_K_M.gguf
+      sha256: 9ea3ecd122a5165b8b81655f29eaf09d71daf841503e4c4212bdfadb36ab3712
    - filename: llama-cpp/mmproj/Qwopus3.5-9B-Coder-MTP-GGUF/Qwopus3.5-9B-Coder-MTP-mmproj.gguf
      sha256: f48daca405a1c768a9514e392c3955dcc4a9d66a5cf64cf45e064092b5f20ee4
      uri: https://huggingface.co/Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF/resolve/main/Qwopus3.5-9B-Coder-MTP-mmproj.gguf
--- a/pkg/grpc/grpcerrors/errors.go
+++ b/pkg/grpc/grpcerrors/errors.go
@@ -58,6 +58,23 @@ func IsLiveTranscriptionUnsupported(err error) bool {
 	return strings.Contains(strings.ToLower(err.Error()), "unimplemented")
 }

+// IsUnimplemented reports whether err is a gRPC Unimplemented status — the
+// signal a backend gives for an RPC it does not implement. The generated
+// UnimplementedBackendServer stub returns exactly this for any RPC a backend
+// (e.g. a Python or external backend) has not overridden, so callers can treat
+// an optional RPC as a no-op rather than a failure. Prefers the typed status
+// code and falls back to the message for paths that lose the status (e.g. errors
+// wrapped across non-gRPC boundaries).
+func IsUnimplemented(err error) bool {
+	if err == nil {
+		return false
+	}
+	if status.Code(err) == codes.Unimplemented {
+		return true
+	}
+	return strings.Contains(strings.ToLower(err.Error()), "unimplemented")
+}
+
 // StreamTranscriptionUnsupported returns the canonical error a backend returns
 // when it (or the loaded model) cannot serve the server-streaming
 // AudioTranscriptionStream RPC. It carries codes.Unimplemented like the live
--- a/pkg/grpc/grpcerrors/errors_test.go
+++ b/pkg/grpc/grpcerrors/errors_test.go
@@ -55,6 +55,18 @@ var _ = Describe("grpcerrors", func() {
 		Expect(grpcerrors.IsModelNotLoaded(err)).To(BeFalse())
 	})

+	DescribeTable("IsUnimplemented",
+		func(err error, want bool) {
+			Expect(grpcerrors.IsUnimplemented(err)).To(Equal(want))
+		},
+		Entry("nil", nil, false),
+		Entry("typed code", status.Error(codes.Unimplemented, "method Free not implemented"), true),
+		Entry("stale stub message (Unknown code)", errors.New("rpc error: code = Unimplemented desc = "), true),
+		Entry("unrelated error", errors.New("context deadline exceeded"), false),
+		Entry("unrelated grpc code", status.Error(codes.Unavailable, "connection refused"), false),
+		Entry("model not loaded is NOT unimplemented", grpcerrors.ModelNotLoaded("parakeet-cpp"), false),
+	)
+
 	It("StreamTranscriptionUnsupported carries Unimplemented and is not ModelNotLoaded", func() {
 		err := grpcerrors.StreamTranscriptionUnsupported("parakeet-cpp", "not a streaming model")
 		Expect(status.Code(err)).To(Equal(codes.Unimplemented))
--- a/pkg/model/process.go
+++ b/pkg/model/process.go
@@ -11,6 +11,7 @@ import (
 	"time"

 	"github.com/hpcloud/tail"
+	"github.com/mudler/LocalAI/pkg/grpc/grpcerrors"
 	"github.com/mudler/LocalAI/pkg/signals"
 	process "github.com/mudler/go-processmanager"
 	"github.com/mudler/xlog"
@@ -52,10 +53,21 @@ func (ml *ModelLoader) deleteProcess(s string) error {
 		hook(s)
 	}

-	// Free GPU resources before stopping the process to ensure VRAM is released
+	// Free GPU resources before stopping the process to ensure VRAM is released.
+	// Free is optional: backends that don't override it (the generated stub, many
+	// Python/external backends, or a federation proxy in distributed mode) return
+	// gRPC Unimplemented. That is expected, not a failure — VRAM is reclaimed when
+	// the process is stopped below, or by the remote unloader for remote backends —
+	// so don't surface it as an error.
 	xlog.Debug("Calling Free() to release GPU resources", "model", s)
 	if err := model.GRPC(false, ml.wd).Free(context.Background()); err != nil {
-		xlog.Warn("Error freeing GPU resources", "error", err, "model", s)
+		if grpcerrors.IsUnimplemented(err) {
+			xlog.Debug("Backend does not implement Free(); GPU release handled on process stop", "model", s)
+		} else {
+			// Now that the expected Unimplemented case is filtered out above, a
+			// remaining error is a genuine failure to release VRAM — surface it.
+			xlog.Error("Error freeing GPU resources", "error", err, "model", s)
+		}
 	}

 	process := model.Process()
Author	SHA1	Message	Date
LocalAI [bot]	29001a88c1	fix(distributed): don't let a dead worker pin the model-load advisory lock (#10600 ) * fix(distributed): don't let a dead worker pin the model-load advisory lock In distributed mode a chat request could fail with: failed to route model with internal loader: routing model ...: loading model ...: advisorylock: acquiring lock <id>: ERROR: canceling statement due to lock timeout (SQLSTATE 55P03) Root cause is two independent defects in the cross-replica model-load path: 1. SmartRouter.Route holds a per-model PostgreSQL advisory lock for the whole cold-load sequence, which includes installBackendOnNode -> InstallBackend, a NATS request-reply with a 15m deadline (DefaultBackendInstallTimeout) that ignored ctx. When the chosen worker died mid-install, the holder sat on the lock for up to 15m. The detached loadCtx (WithoutCancel) had no deadline, so nothing capped the hold. 2. The acquiring statement, pg_advisory_lock(), is subject to any deployment global lock_timeout. A common operator setting (e.g. 10s) aborts the wait with SQLSTATE 55P03, so every other replica's request for that model hard -errored instead of waiting for the in-progress load and reusing it. For the ~15m window the model was effectively unroutable. Fixes: - advisorylock.WithLockCtx (postgres): SET lock_timeout = 0 on its dedicated connection (RESET before it returns to the pool) so the Go context, not a deployment-wide GUC, governs how long we wait. Waiters now block and then re-check, reusing the model another replica just loaded. - SmartRouter: bound the detached loadCtx with a single ModelLoadCeiling so the lock is always released in bounded time even if a sub-step wedges. Default is the configured backend.install deadline + 10m (staging + LoadModel margin), so a legitimately slow load is never cut. - installBackendOnNode: use singleflight.DoChan + select on ctx.Done() so the install wait honors cancellation; the ceiling can then actually free a caller pinned behind a dead worker. The shared install still coalesces via singleflight. Reproduced both defects as failing tests first (a real 55P03 against a testcontainer with a short lock_timeout; a wedged install that blocks Route) and confirmed green. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(distributed): bound advisory-lock wait instead of disabling lock_timeout Setting lock_timeout = 0 to override a deployment's short global lock_timeout meant "wait forever" server-side. Safe for SmartRouter.Route (its loadCtx now carries the model-load ceiling) but unsafe for the schema-migration callers that pass context.Background(): a holder whose session never releases would hang them indefinitely. Derive the server-side lock_timeout from the caller's context instead: its remaining budget plus a margin (so the Go context's cancellation still wins with a clean error and the server bound is only a backstop), or a finite 30m backstop when the context has no deadline. Never zero - "wait forever" is no longer possible, while a deployment's hostile short lock_timeout is still overridden so legitimate cross-replica waits don't fail with 55P03. Added a spec proving a deadline-less waiter gives up at the (shrunk) backstop rather than hanging. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-07-02 09:52:51 +02:00
LocalAI [bot]	b0bfa0852e	chore: ⬆️ Update CrispStrobe/CrispASR to `fcbc8718e654995e3bd2d0c98bcb8e55e297d23c` (#10634 ) ⬆️ Update CrispStrobe/CrispASR Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-02 09:48:20 +02:00
LocalAI [bot]	39a93e91cf	chore: ⬆️ Update vllm-metal (darwin) to `v0.3.0.dev20260701132215` (#10633 ) ⬆️ Update vllm-project/vllm-metal (darwin) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-02 09:48:08 +02:00
LocalAI [bot]	26e0c98967	chore: ⬆️ Update leejet/stable-diffusion.cpp to `3590aa8d626e671a1b1dc84506ea2932a243a480` (#10631 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-02 09:47:54 +02:00
LocalAI [bot]	9acca54b25	chore: ⬆️ Update mudler/parakeet.cpp to `e8acc6172a94e20a952cf1843decace5d771a94b` (#10629 ) ⬆️ Update mudler/parakeet.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-02 09:47:41 +02:00
LocalAI [bot]	2728e6000e	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `068b173649f2fd8dc96b35ada5a0b76d8985105d` (#10632 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-02 09:47:28 +02:00
LocalAI [bot]	006310d746	chore: ⬆️ Update ggml-org/llama.cpp to `4fc4ec5541b243957ae5099edb67372f8f3b550e` (#10630 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-02 09:47:15 +02:00
LocalAI [bot]	05acdb1778	chore: ⬆️ Update ggml-org/whisper.cpp to `6fc7c33b4c3a2cec83e4b65abd5e96a890480375` (#10635 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-02 09:47:01 +02:00
LocalAI [bot]	5e68b5700c	chore(model-gallery): ⬆️ update checksum (#10637 ) ⬆️ Checksum updates in gallery/index.yaml Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-02 09:26:32 +02:00
pos-ei-don	7910018249	fix(vllm): non-streaming tool-call regression after #10351 (#10638 ) fix(vllm): non-streaming tool-call regression after #10351 (native_streaming is a capability flag, not a state flag) #10351 introduced native streaming via `parser.extract_tool_calls_streaming` and gated the post-loop `extract_tool_calls` block on `native_streaming and not native_streaming_error`. That works for streaming requests, but for non-streaming requests the same flag is still True (it only means "the parser can stream", not "we actually streamed"), so the block was skipped and the `elif` cleared `content = ""` — the tool call was silently lost. Symptom: non-streaming chat.completions with `tools=[...]` returns `finish_reason: "stop"` with `content: ""` and no `tool_calls`. Streaming requests are unaffected. Fix: gate both branches on `streaming` too, so the extract_tool_calls block runs for non-streaming requests (and for streaming requests that fell back to the buffered path). Reproduction (vLLM 0.24, Qwen3-Coder-Next-NVFP4, qwen3_coder parser): curl -s -X POST http://localhost:8080/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"model":"coder","stream":false, "messages":[{"role":"user","content":"7*8 via calc"}], "tools":[{"type":"function","function":{"name":"calc", "parameters":{"type":"object", "properties":{"expression":{"type":"string"}}}}}]}' Before: finish_reason: "stop", content: "", tool_calls: [] After: finish_reason: "tool_calls", tool_calls[0].function.name: "calc" Streaming path re-verified in the same setup: delta.tool_calls arrives token-by-token, finish_reason: "tool_calls", no raw XML in content. Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>	2026-07-02 09:26:14 +02:00
LocalAI [bot]	1a03712a6f	fix(hipblas): symlink amdgpu.ids so ROCm backends find the ASIC ID table (#10627 ) * fix(hipblas): symlink amdgpu.ids so ROCm backends find the ASIC ID table ROCm's bundled libdrm_amdgpu looks up the GPU ASIC ID table at a hardcoded fallback path, /opt/amdgpu/share/libdrm/amdgpu.ids, which is only populated by AMD's full amdgpu-install (graphics/DKMS) stack. The hipblas image is compute-only and doesn't have it, so every model load logs "No such file or directory" and the GPU can't be identified. Symlink it to the equivalent file already shipped by Ubuntu's libdrm-amdgpu1 package. Fixes #10624 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(hipblas): correct amdgpu.ids source package name in comment Verified against the real rocm/dev-ubuntu-24.04:7.2.1 image with hipblas-dev/hipblaslt-dev/rocblas-dev installed: /usr/share/libdrm/amdgpu.ids is owned by libdrm-common, not libdrm-amdgpu1 as the comment said. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-02 09:25:14 +02:00
LocalAI [bot]	703ea32de6	chore: ⬆️ Update vllm-metal (darwin) to `v0.3.0.dev20260630095652` (#10616 ) ⬆️ Update vllm-project/vllm-metal (darwin) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-01 21:56:59 +02:00
LocalAI [bot]	751db06e35	chore: ⬆️ Update CrispStrobe/CrispASR to `8fd9db8fec8cb5e929d23d3267ed5817794feb1a` (#10615 ) ⬆️ Update CrispStrobe/CrispASR Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-01 21:56:41 +02:00
LocalAI [bot]	f46c0e9c83	docs: ⬆️ update docs version mudler/LocalAI (#10614 ) ⬆️ Update docs version mudler/LocalAI Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-01 21:56:21 +02:00
LocalAI [bot]	0d8adfc59a	chore: ⬆️ Update ggml-org/llama.cpp to `0eca4d490e591d4e93058d07540cf47278a72577` (#10617 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-01 09:31:50 +02:00
LocalAI [bot]	43f2615e19	chore: ⬆️ Update vllm-project/vllm cu130 wheel to `0.24.0` (#10618 ) ⬆️ Update vllm-project/vllm cu130 wheel Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-01 08:53:03 +02:00
LocalAI [bot]	875c539ad5	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `29431b31c89e79c10f8736e8f2742485ba1713d6` (#10620 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-01 08:52:36 +02:00
LocalAI [bot]	d641ded194	chore: ⬆️ Update ggml-org/whisper.cpp to `0874de3e8e8e48361dba85c7fe6d176f008bf158` (#10621 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-01 08:43:40 +02:00
LocalAI [bot]	40445fff05	chore: ⬆️ Update leejet/stable-diffusion.cpp to `484baa41e5e006c52dcd4addc38c830b9489745f` (#10619 ) * ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix(stablediffusion-ggml): adapt to new generate_image() out-param signature leejet/stable-diffusion.cpp@484baa4 changed generate_image() from returning sd_image_t* to returning bool with images_out/num_images_out out-parameters (same pattern already used by generate_video()). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-01 08:32:57 +02:00
Tai An	057dee956a	fix(launcher): keep data/config under ~/.localai (#10610 ) (#10613 ) The launcher starts the server with run --models-path/--backends-path but leaves --data-path and the dynamic config dir unset, so the server falls back to its /data and /configuration defaults. is kong.ExpandPath("."), i.e. the launcher process CWD (commonly the user's home root), producing ~/data and ~/configuration outside ~/.localai and an agent-pool stateDir under ~/data. Pass --data-path and --localai-config-dir explicitly, rooted at the launcher's own data directory (GetDataPath() -> ~/.localai), so data and config stay consistent with --models-path/--backends-path.	2026-06-30 22:14:59 +02:00
Adira	4ec39bb776	fix(watchdog): don't log optional Free() as an error when backend returns Unimplemented (#10602 ) (#10607 ) * fix(watchdog): don't log optional Free() as an error when backend returns Unimplemented (#10602) When the watchdog evicts a model, deleteProcess calls the backend's gRPC Free() to release VRAM before stopping the process. Free is optional: backends that don't override it -- the generated UnimplementedBackendServer stub, many Python/external backends, or a federation proxy in distributed mode -- return gRPC Unimplemented. That is expected, not a failure: VRAM is reclaimed when the local process is stopped, or by the remote unloader for remote backends. Logging it as "WARN Error freeing GPU resources" made a benign, optional RPC look like a fault (the alarming line in #10602, seen in distributed mode where the model is remote and Free hits a stub). Treat gRPC Unimplemented from Free() as a no-op logged at Debug; genuine failures still Warn. Free() is still attempted for every backend, so any backend that does implement it is unaffected. Add a reusable grpcerrors.IsUnimplemented helper following the package's existing code-based detection idiom (prefer the typed status code, fall back to the message across non-gRPC boundaries), with table tests. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com> * fix(watchdog): log a non-Unimplemented Free() failure at error level Per review: now that the expected gRPC Unimplemented case is split out and logged at Debug, any remaining Free() error is a genuine failure to release VRAM, so surface it at error level instead of warn. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com> --------- Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>	2026-06-30 22:14:01 +02:00
Ettore Di Giacinto	25ecb9f015	fix(gallery): use Q8_0 for lfm2.5-8b-a1b to fix poor tool-call quality The Q4_K_M quant degraded tool-call reliability for LFM2.5-8B-A1B. Switch the gallery entry to the Q8_0 GGUF (sha256 verified via HF x-linked-etag) while keeping the native jinja tool-parsing config. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]	2026-06-30 17:46:20 +00:00
LocalAI [bot]	2be495f9c0	fix(kokoros): implement AudioTranscriptionLive trait stub (#10612 ) The backend.proto AudioTranscriptionLive bidirectional streaming RPC added new required trait items (AudioTranscriptionLiveStream + audio_transcription_live) on the generated Backend trait. The kokoros (TTS) backend did not implement them, breaking its release build with E0046 (missing trait items). kokoros is text-to-speech and has no live-ASR support, so stub the method to return UNIMPLEMENTED, mirroring the existing audio_transcription_stream stub. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 19:38:41 +02:00